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Sir: 



I, Randy Berka, do hereby state and declare that 

1. I received a Ph.D. in Microbiology and Immunology from the University of Colorado, 
Health Science Center, Denver, Colorado, in 1983. I have been employed at Novozymes 
Biotech, Inc., Davis, California, since 1992 where I am currently a Research Fellow. 

2. I have read the Office Action dated January 13, 2004 and the Advisory Action dated July 
12, 2004 issued in connection with the above-referenced patent application and understand 
that claims 100, 102-104, 109-111, 114-115, and 120-124 are rejected under 35 U.S.C. § 112, 
first paragraph, for lack of written description, and claims 100, 102-104, 109-111, 114-115, 117 
and 119-124 are rejected under 35 U.S.C. § 112, first paragraph, for lack of enablement. I 
respectfully disagree with these conclusions. 

3. The Office states that the specification does not adequately describe suitable structural, 
physical and chemical characteristics of the claimed nucleic acids or the enzyme they encode to 
distinguish them from nucleic acid sequences which are not claimed because the recited 
structural relationships are arbitrary since neither the specification nor the prior art discloses 
any definitive relationship between protein function and % identity or homology at either the 
nucleotide or amino acid level. I respectfully disagree with the Office's statement. 

The recited structural relationships of (1) percent identity of the amino acid sequences 
encoded by the genes, (2) percent homology of the nucleic acid sequences of the genes, and 
(3) nucleic acid hybridizations under defined stringent conditions to identify complementary 
strands of genes encoding the same or similar enzyme or protein function are far from 



arbitrary. The structural relationships are based on very conservative, logical and rational 
scientific deductions that are supported by detailed statistical analyses reported in the scientific 
literature. The claimed relationships are based on the work of Chothia and Lesk (Chothia and 
Lesk, 1986, EM BO J. 5: 823-826) and Bork (Bork eta/., 1994, Curr. Opin. Struct. Biol. 4: 393- 
403; Bork et a/., 1998, J. Mol. Biol. 283: 707-725), who established a solid relationship between 
amino acid sequence identity and structural identity of homologous proteins. These 
investigators found that structural divergence was an exponential function of sequence 
divergence, expressed in terms of the fraction of residues that differ between sequences. The 
reliability of structural annotation transferred by homology, therefore, depends on the sequence 
identity of the homologous proteins (Chothia and Lesk, 1986). Since the collective structural 
properties of proteins (circumscribed by their primary structure) are responsible for their 
biological activities, proteins that share a high degree of amino acid sequence identity are 
known with reasonable certainty to possess the same biochemical/biological activities. This 
deduction is based on the following observations from the literature. 

First, there are literally hundreds of reports in which investigators have used nucleic 
acids probes from one species to clone genes encoding a homologous protein/enzyme from a 
heterologous source. This approach has been repeatedly employed by us to clone several 
genes from diverse fungi including a laccase gene (Berka et a/., 1997, Appl. Environ. Microbiol. 
63: 3151-3157), a phytase gene (Berka eta/., 1998, Appl. Environ. Microbiol. 64: 4423-4427), a 
mutanase gene (Fuglsang eta/., 2000, J. Biol. Chem. 275: 2009-2018), and a 5-aminolevulinate 
synthase gene (Elrod eta/., 2000, Curr. Genet 38: 291-298). 

Second, homologues of a newly sequenced gene product can be identified via database 
searches using BLAST, Smith-Waterman and other computer algorithms and structure and 
function assigned to the gene product (Bork et a/., 1998). This is based on the concept that 
sequence similarity implies structural and functional similarity. 

Third, with the exponential growth of sequence databases and protein structure 
databases over the last 20 years, relationships between sequence similarity and functional 
similarity have emerged, and particular thresholds of sequence conservation and functional 
similarity have become increasingly apparent. It is clear that functional classification is 
conserved over a range of sequence similarity and biological/biochemical function diverges only 
when sequence similarities are low enough that they have no statistical significance. In the 
present case, a conservative sequence identity threshold (90%) was chosen at which all 
homologues in a BLAST search of the public databases gave only functionally identical enzymes. 
I provide seven examples in which BLASTP was used to query a publicly available sequence 
databases (Protein Information Resource, PIR-NREF, GeneseqP) with different fungal enzymes 
(amylase, acid protease, glucoamylase, exocellobiohydrolase, endoglucanase, phytase, and 
lipase) and asked if homologous proteins with at least 90% identity possessed the same 
biological/biochemical activity. The results summarized in Appendix 1 show clearly and 
convincingly that proteins with 90% sequence identity are annotated to have the same activity. 
These deductions are further supported by Wilson and colleagues (Wilson et a/., 2000, J. Mol. 
Biol. 297: 233-249) who established a clear relationship between sequence similarity and 
functional similarity. Wilson et al. found that functional identity is conserved down to 
approximately 40% amino acid sequence identity, and that among proteins that share 50-100% 
sequence identity, function is conserved in almost all. This observation extends the previous 
observations of Clothia and Lesk (1986) which compared 32 pairs of homologous proteins and 
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found that with pairs whose sequence identity is greater than 50%, at least 90% of the 
residues lie in structurally common cores. It is noteworthy that Wilson et al. also found that 
percent identity is more effective at quantifying functional conservation than probabilistic scores 
(P-values, ^scores). Additionally, Clothia and Lesk (1986) note that a protein provides a close 
structural model for other homologous proteins with which its sequence is at least 50% 
identical. 

The essential feature of the claimed invention is isolated nucleic acids that hybridize to 
nucleotides 568 to 2045 of SEQ ID NO: 1 under the specified stringency conditions and have a 
specified percent homology and encode polypeptides with a specific function, i.e., 
phospholipase B activity. It is well known that hybridization techniques using a known DNA as 
a probe under specified stringency conditions were conventional in the art at the time of filing. 
The claim is drawn to nucleic acids all of which must hybridize with nucleotides 568 to 2045 of 
SEQ ID NO: 1 and must encode a protein with phospholipase B activity. 

The logic of the patent examiner's argument that claimed synthetic sequences of 80- 
90% sequence identity are unusable as probes is flawed, since the claimed proteins derived 
from the resulting genes must have phoshpholipase B activity. The fact that two sequences 
that share 80% identity to SEQ ID NO: 1 may have as little as 40% sequence identity to each 
other is irrelevant as long as the gene they detect shares 80-90% identity with SEQ ID NO: 1 
and the corresponding gene product exhibits phospholipase B activity. 

A skilled person in the art would not expect substantial variation among species 
encompassed within the scope of the claims, because the specified hybridization and percent 
identity conditions set forth in the claims yield structurally similar DNAs and proteins. There are 
hundreds of papers where genes coding for proteins of similar function were cloned on the 
basis of hybridization to heterologous probes under a variety of stringency conditions {e.g., 
Berka et al., 1997, Appl. Environ. Microbiol. 63: 3151-3157; Berka et al., 1998, Appl. Environ. 
Microbiol. 64: 4423-4427; Fuglsang et a/., 2000, 1 Biol. Chem. 275: 2009-2018; Elrod et aL, 
2000, Curr. Genet 38: 291-298; Kraus et aL, 1989, Proc Nat Acad. Sci. USA 86: 9193-9197; 
Sweitzer et al., 1995, J. Biol. Chem. 270: 16510-16513; Kraus and Aaronson, 1991, Methods 
EnzymoL 200: 546-556; Kraus and Aaronson, U.S. Patent No. 6,639,060). In such cases, the 
structurally similar proteins have a high degree of sequence identity. Such cross-hybridization 
is to a significant extent predictive of gene relatedness, and gene relatedness is in turn 
predictive of functional similarity. Furthermore, as noted above, from the publications of 
Chothia and Lesk (1986), Bork (1994 and 1998), and Wilson et al. (2000) it is virtually certain 
that proteins with a high degree of amino acid sequence identity (>50%) have the same 
biological/biochemical function. 

Each of the claimed structural features (percent identity, percent homology, and 
hybridization) specifies, therefore, a genus of structurally- and functionally-related enzymes 
having phospholipase B activity. 

4. The Office also states that while one of skill in the art can readily envision numerable 
species of nucleic acid sequences that are at least 90% identical to the recited reference 
nucleotide sequence and that encode a polypeptide that is at least 90% identical to the recited 
reference amino acid sequence, one cannot envision which of these also encode a polypeptide 
with phospholipase B activity. 
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As I have described above in paragraph 3, the work of Wilson and colleagues (Wilson et 
al., 2000) demonstrates convincingly that proteins which share 50-100% sequence identity 
consistently and reliably represent molecules with the same biochemical/biological function. 

5. The Office also states that the specification does not provide any information on what 
amino acid residues are necessary and sufficient for phospholipase B activity or what amino 
acid sequence modifications, e.g., insertions, deletions and substitutions, would be permissible 
in a phospholipase B polypeptide that would improve or at least would not interfere with the 
biological activity or structural features necessary for the biological activity and stability of the 
protein. Moreover, the Office argues that since there are no other examples of a phospholipase 
B known that have structural homology with SEQ ID NO: 2, it is not possible to even guess at 
the amino acid residues which are critical to its structure or function based on sequence 
conservation. I disagree with this assertion. 

John Maynard Smith proposed more than 30 years ago (Smith, 1970, Nature 225: 563- 
564) that the occurrence of functional mutant proteins that differ from wild-type is frequent for 
evolution to be possible. Since then, numerous evolutionary and mutagenesis studies have 
supported the assertion that proteins are highly plastic in tolerating amino acid changes 
(Creighton, 1993, Proteins (Freeman, New York); Bowie et al., 1990, Science 247: 1306-1310). 

Guo et a/., 2004, Proc. Nat Acad Sci USA 101: 9205-9210, observed that various 
residues of a protein are differentially sensitive to substitutions, and that tolerance of the entire 
protein to random change can be characterized by a probabilistic relationship termed the "x- 
factor." The ^factor is broadly defined as the probability that a random amino acid 
replacement will lead to functional inactivation. Moreover, they determined the x-factor to be 
34% ± 6%. Contrary to the Office's contention that random (even conservative) changes in a 
protein in the absence of structural information would adversely affect folding and/or activity, 
the findings of Guo et al. (2004) support the contrary, i.e., that proteins are generally tolerant 
to random amino acid substitutions, and the probability of destroying protein function is 
surprisingly small. Furthermore, Clothia and Lesk (1986) note that the structure of the active 
site domains may be highly conserved among homologous proteins even when overall amino 
acid sequence identities are low. 

Makiewicz et al., 1994, 1 Mol. Biol. 240: 421-433, examined 12 or 13 different amino 
acid substitutions at each residue across 90% of the 360 amino acid E coli lac repressor 
protein. Reanalysis of their data by Guo et al. (2004) revealed an x-factor value of 34% which 
is identical to the value for random inactivation of human 3-methyladenine DNA glycosylase 
studied by Guo et al. Axe eta/., 1998, Biochem. 37: 7157-7166, found that 95% of randomly 
introduced single amino acid substitutions did not lead to inactivated ribonuclease enzyme. 
Rennell et al., 1991, J. Mol. Biol. 222: 67-88, found that approximately 84% of amino acid 
substitutions in T4 lysozyme did not cause inactivation. 

The phospholipase B enzyme described in the present application is novel and 
represents the first member of a new "genus". However, the enzyme harbors sequence and 
structural motifs that are well known for enzymes of the phosphoesterase class (which contains 
not only phospholipase B, but also some phospholipase C, and other phosphomonoesterase 
enzymes). Using a standard software tool (HMMPFAM) that is well known in the art 
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(Sonnhammer et al. t 1998, Nucleic Acids Research 26: 320-322), a profile of conserved amino 
acid motifs can be generated representing highly conserved sequences in this family. As noted 
by Guo et al. (2004), such highly conserved segments may be critical for enzyme activity or 
biological function, and they are expected to be less tolerant for substitutions (see Appendix 2 
for an example of this analysis). 

A skilled person with such information could easily prepare a variant of SEQ ID NO: 2 
containing a deletion, insertion, and/or substitution of one or more amino acid residues. The 
specification on page 7, lines 8-30, provides that conservative amino acid substitutions can be 
made that do not significantly affect the folding and/or activity of SEQ ID NO: 2 and provide 
examples of conservative substitutions within the group of basic amino acids, acidic amino 
acids, polar amino acids, hydrophobic amino acids, aromatic amino acids, and small amino 
acids. The specification on page 13, line 9, to page 14, line 4, also describes methods for 
identifying amino acid residues essential to the activity of a phospholipase B. 

The Office is, therefore, incorrect in stating that it is not possible to even guess at the 
amino acid residues which are critical to its structure or function based on sequence 
conservation. 

6. The Office states that Example 2 "does not demonstrate using a nucleic acid of SEQ ID 
NO: 1 to isolate a claimed nucleic acid from a different source, nor does the specification 
identify a source, from which one would be able to isolate a claimed nucleic acid, other than A. 
otyzae. More importantly, the claims are not limited to sequences obtainable from a natural 
source, and the example does not teach how to make a nucleic acid readable on the claims that 
cannot be found in nature and encodes a different amino acid sequence than SEQ ID NO: 2." 

Preliminarily, the specification contains an extensive disclosure of techniques which are 
well known in the art and indeed routine for persons of ordinary skill for identifying other 
nucleotides of the present invention. The specification describes methods for preparing and 
probing DNA libraries (Example 1-2); for isolating nucleic acids encoding the phospholipases 
(Example 3); for determining cross-hybridization of the nucleic acids encoding phospholipases 
using (i) nucleotides 568 to 2045 of SEQ ID NO:l, (ii) the cDNA sequence contained in 
nucleotides 568 to 2045 of SEQ ID NO:l, or (iii) a complementary strand of (i) or (ii) (page 5, 
line 1, to page 7, line 7); for comparing the percent identity of the deduced amino acid 
sequences of the phospholipases to amino acids 20 to 464 of SEQ ID NO: 2 using the Clustal 
method according to Higgins, 1989, CABIOS 5: 151-153 (Example 4); for determining the 
degree of homology between two nucleic acid sequences using the Wilbur-Lipman method 
according to Wilbur and Lipman, 1983, Proceedings of the National Academy of Science USA 
80: 726-730 (page 12, line 29, to page 13, line 8); for producing the phospholipases (Example 
5); and for purifying the phospholipases and characterizing the properties of the encoded 
phospholipases (Examples 6-9). A skilled person could easily isolate and identify the claimed 
nucleic acid sequences using Applicants 1 disclosure. 

The Office indicates that the specification does not demonstrate the isolation of a 
claimed nucleic acid from a different source using SEQ ID NO: 1 or SEQ ID NO: 2, nor does the 
specification identify a source, from which one would be able to isolate a claimed nucleic acid, 
other than A. oryzae* I conducted a BLASTP search of several publicly available protein 
databases (NR, PIRNREF, GENESEQP, SWALL) using SEQ ID NO: 2 as the query sequence to 
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determine whether SEQ ID NO: 2 could be used to identify homologues that encode a 
phospholipase subsequent to the filing date of the instant application. The results of the search 
revealed four proteins having phospholipase activity from Aspergillus niger, Oryza sativa, and 
Burkholderia pseudomaller. 

(1) Accession no. ADF82794: Aspergillus niger phospholipase PLP03, 
Expectancy = le-139, Identities = 429/444 (54%). 

(2) Accession no. NR 52076602: phospholipase -like protein [Oryza sativa 
(japonica cultivar-group)], Expectancy = 2e-44, Identities = 143/414 (34%). 

(3) Accession no. NR 50919526: putative phospholipase [Oryza sativa], 
Expectancy = 4e-40, Identities = 136/417 (32%). 

(4) Accession no. NR 53719489: putative phospholipase [Burkholderia 
pseudomallei K96243], Expectancy = le-36, Identities = 130/418 (31%). 

These results clearly demonstrate the ability to use SEQ ID NO: 2 to identify proteins 
having phospholipase activity from other sources. A skilled person would be able to isolate the 
gene encoding these phospholipases using Applicants' specification. 

With regard to man-made variants, a skilled person could easily prepare a variant of 
SEQ ID NO: 2 containing a deletion, insertion, and/or substitution of one or more amino acid 
residues. The specification on page 7, lines 8-30, provides that conservative amino acid 
substitutions can be made that do not significantly affect the folding and/or activity of SEQ ID 
NO: 2 and provide examples of conservative substitutions within the group of basic amino acids, 
acidic amino acids, polar amino acids, hydrophobic amino acids, aromatic amino acids, and 
small amino acids. Amino acid substitutions that do not generally alter the specific activity are 
described, for example, by H. Neurath and R.L. Hill, 1979, In, The Proteins, Academic Press, 
New York. The specification on page 13, line 9, to page 14, line 4, further describes methods 
for identifying amino acid residues essential to the activity of a phospholipase B. In fact, the 
findings of Guo et al. (2004) support the contention that proteins are generally tolerant to 
amino acid substitutions, and the probability of destroying protein function is surprisingly small. 

7. The Office argues that "it is not routine in the art to screen for multiple substitutions or 
multiple modifications, as encompassed by the instant claims ..." I disagree with this statement. 
As of October 1999, a skilled person was able to routinely produce thousands of mutants of SEQ 
ID NO: 1 through mutagenesis and other techniques and screen the mutants in a short period of 
time without undue experimentation. See, for example, Christians et al., 1999, Nature 
Biotechnology 17: 259-264; Zocher et al., 1999, Analytica Chimica Acta 391: 345-351; Rieger et 
al., 1999, Yeast 15: 973-986; Genome Analysis, A Laboratory Manual, Volume 3, Cloning Systems, 
Robotic Replication pp. 20-22, Cold Spring Harbor Laboratory Press, 1997; Kell, 1999, Trends in 
Biotechnology 17: 89-91; and Dove, 1999, Nature Biotechnology 17: 859-863; Armstrong et al., 
1998, Journal of 'Biomolecular Screening 3: 271-275; Eickhoff et al., 1999, BioMethods 10: 17-30; 
and Stevens et al., 1998, Journal of Biomolecular Screening 3: 305-311. In addition, the 
specification provides on page 13, line 9, to page 14, line 4, how to identify essential amino acids 
in the sequence of SEQ ID NO: 2. A skilled person can, therefore, predict with reasonable 
statistical accuracy which modifications, if any, would result in a loss of the desired activity/utility. 
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8. The undersigned declarant declares further that all statements made herein of her own 
knowledge are true and that all statements made on information and belief are believed to be 
true and further that these statements are made with the knowledge that willful false 
statements and the like so made are punishable by fine or imprisonment, or both, under Section 
1001 of Title 18 of the United States Code and that such willful false statements may jeopardize 
any patent issuing thereon. 
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APPENDIX 1 



emonstration that proteins which share 50-100% amino acid sequence identity are annotated 

to have the same biochemical/biological function. 

A. Query sequence = Aspergillus niger q\\\coaxx\y\ase (glucan 1,4-alpha-glucosidase 
(EC 3.2.1.3)). The following BLASTP hits with at least 50% sequence identity are all 
annotated as glucoamylase enzymes (except those noted as hypothetical or unnamed 
products): 



Abstract 


E-value 


Identity 


DirnreflNF00073574l Glucoamylase I orecursor (EC 3.2.1.3} (Gluca... 


0.0 


100 


DirnreflNF00889958l alucan 1.4-alDha-alucosidase (EC 3.2.1.3) pr... 


0.0 


100 


DirnreflNF00626751l alucan 1.4-alpha-alucosidase fEC 3.2.1.31 or... 


0.0 


98 


DirnreflNF00626853l Glucoamvlase precursor fEC 3.2.1.31 fGlucan ... 




0.0 


98 


oirnreflNF00889947l Glucoamvlase precursor fEC 3.2.1.3) TAsoerai... 


0.0 


97 


DirnreflNF01651009l alucoamvlase rAsoeraillus awamoril 


0.0 


96 


DirnreflNF00626460l Glucoamylase precursor (EC 3.2.1.3) fGlucan ... 


0.0 


94 


oirnreflNF00889975l Glucoamylase precursor fEC 3.2.1.3) fGlucan ... 


0.0 


94 


DirnreflNF01328097l Glucoamvlase TAsperaillus niaerl 


0.0 


93 


DirnreflNF00626366l Dreoroalucoamvlase G2 TAsDeraillus niaerl 


0.0 

w _ _ . , 


93 


DirnreflNF00889945l alucan 1,4-alDha-alucosidase fEC 3.2.1.3) G2... 


0.0 


93 


DirnreflNF00889968l Glucoamvlase-471 TAsoerqillus awamoril 


0.0 


98 


DirnreflNF00889964l Glucoamylase-471 TAsoerqillus awamoril 


0.0 


98 


DirnreflNF00889951l Glucoamvlase-471 fl,4-AlDha-D-Glucan Glucohv... 


0.0 


98 


DirnreflNF00626575l Glucoamvlase precursor fEC 3.2.1.3) fGlucan ... 


0.0 


66 


oirnreflNF00494189l Glucoamvlase orecursor fEC 3.2.1.3) ITalarom... 


0.0 


61 


pirnreflNF00649388l alucan 1,4-alpha-alucosidase fEC 3.2.1.3) or... 


0.0 


55 


pirnreflNF00647663l Glucoamvlase orecursor fEC 3.2.1.3) fGlucan ... 


0.0 


55 


DirnreflNF00648280l alucan 1.4-alpha-alucosidase TNeurosoora era... 


0.0 


55 


DirnreflNF01576653l hvoothetical orotein MG01096.4 TMaanaoorthe ... 


0.0 


53 


DirnreflNF01709909l hvoothetical orotein FG06278.1 TGibberella z... 


e-173 


50 



B. Query sequence = Aspergillus niger aspergillopepsin (acid proteinase/aspartyl 
protease/preproproctase). The following BLASTP hits with at least 50% sequence identity 
are all annotated as acid protease/aspartyl protease enzymes (except those noted as 
hypothetical or unnamed products): 



Abstract 



E-value % Identity 



DirnreflNF00626537l Asperaillooepsin A precursor (EC 3.4.23. 18} ... 


0.0 


100 


DirnreflNF00626722l asDeraillopeDsin I (EC 3.4.23.18) precursor ... 


0.0 


99 


pirnreflNF00918479l Asoeraillopepsin A precursor (EC 3.4.23.18) ... 


0.0 


99 


DirnreflNF00626425l PreoroDroctase B Drecursor TAsperaillus niaerl 


0.0 


96 


pirnreflNF00889972l AsoerailloDeDsin A precursor (EC 3.4.23.18) ... 


0.0 


96 


pirnreflNF00626729l AsperailloDeosin rAsoeraillus phoenicisl 


0.0 


99 


DirnreflNF00627288l AsoerailloDeDsin i (EC 3.4.23.18) TAsDeraill... 


e-163 


71 


DirnreflNF00626684l Asperqillopepsin A precursor (EC 3.4.23.18) ... 


e-155 


■ 

67 


DirnreflNF00626584l asperaillooepsin 0 rAsoeraillus oiyzael 


e-155 


67 


oirnreflNF00626993l PropenicilloDeDsin-JT2 precursor TPenicilliu... 


e-152 


67 


oirnreflNF00176292l Putative asDartic Drotease rEmericella nidul... 


e-145 


65 


DirnreflNF00889953l asDerailloDeosin I (EC 3.4.23.18) rAsoeraill... 


e-144 


81 


DirnreflNF00627293l AsoeraillopeDsin F precursor (EC 3.4.23.18) ... 


e-142 


66 


oirnreflNF01463777l acid proteinase TMonascus purpureusl 


e-141 


63 


oirnreflNF00627188l Aspartic proteinase rPenicillium roquefortiil 


e-138 


63 


oirnreflNF00626580l Aspartic proteinase IM TAsperaiilus orvzael 


e-137 ; 


65 


oirnreflNF01229506l Aspartic Proteinase rAsoeraillus oryzael 


e-129 


70 


oirnreflNF00627002l Preprooenicillopepsin-JT3 precursor [Penicil... 


e-129 


58 


oirnreflNF00626995l Penicillooepsin (EC 3.4.23.20) (Peptidase A)... 


e-127 


68 


pirnreflNF00626992l penicillooepsin (EC 3.4.23.20) rPenicillium ... 


e-127 


68 


pirnreflNF00747468l Aspartic proteinase precursor (EC 3.4.23.-) ... 


e-104 


49 


oirnreflNF00646517l Endothiaoepsin precursor (EC 3.4.23.22) (Aso... 


e-103 


50 


pirnreflNF01576663l hypothetical protein MG02898.4 TMaanaoorthe ... 


e-103 


49 


oirnreflNF00646493l Endothiaoepsin rCrvohonectria parasitical 


e-102 


55 


C. Query sequence = Aspergillus oryzae a-amylase (AMY1, Taka-amylase, amyA). 

The following BLASTP hits with at least 50% sequence identity are all annotated as a-amylase 
enzymes (except those noted as hypothetical or unnamed products): 


Abstract 


E-value 


% Identity 


DirnreflNF00626669l Abha-amvlase A Drecursor (EC 3.2.1.1} (Taka... 


0.0 


100 


DirnreflNF01651008l abha-amvlase TAsDeraillus awamoril 


0.0 


99 


DirnreflNF00626583l unnamed Drotein Droduct rAsoeraillus orvzael 


0.0 


100 


DirnreflNF01544944l abha-amvlase rAsoeraillus kawachiil 


0.0 


100 


DirnreflNF00626750l alpha-amvlase (EC 3.2.1.1") precursor TAsDera... 


0.0 


99 


DirnreflNF00626854l Abha-amvlase Drecursor (EC 3.2.1.1} (1.4-al... 


0.0 


99 


DirnreflNF00626612l Taka-amvlase A (EC 3.2.1.1} (Abha-amylase} ... 


0.0 


99 
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Dirnref 1 NF00625791 1 Taka-amvlase A (EC 3.2.1.11 (Alpha-amvlase) ... 


0.0 


99 


DirnreflNF00626368l alDha-amylase-Drecursor TAsDeraillus niaerl 


0.0 


99 


Dirnref INF00889948I Abha-amvlase B precursor (EC 3.2.1.1) (1.4-.. . 


0.0 


99 


Dirnref INF00626646I abha-amvlase (EC 3.2.1.1) precursor TAsDera... 


0.0 


99 


Dirnref INF00626648I Taka-amylase A (Taa-Gl) precursor TAsDeraill... 


0.0 


99 


DirnreflNF00626351l alpha-amvlase-Drecursor TAsperaillus niaerl 


0.0 J 99 


DirnreflNF00889969l Abha-amvlase A Drecursor (EC 3.2.1.1) (1.4-.. . 


0.0 


99 


pirnreflNF00626590l alpha-amvlase (EC 3.2.1.1) Drecursor TAsDera... 


0.0 


99 


DirnreflNF00626638l Taka Amvlase TAsDeraillus oryzael 


0.0 


100 


DirnreflNF00626642l alpha-amvlase (EC 3.2.1.1) [Asperaillus orvzael 


0.0 


97 


DirnreflNF00176034l Alpha-amvlase AmvA TEmericella nidulansl 


A A 

0.0 




DirnreflNF00073571l Acid-stable abha-amvlase TAsDeraillus kawac... 


0.0 


67 


DirnreflNF01651007l abha-amvlase TAsDeraillus awamoril 


0.0 


68 


DirnreflNF00626487l abha-amvlase (EC 3.2.1.1) TAsDeraillus niaerl 


0.0 


67 


nirnrpflNFf)nfi?fiS18l Add alDha-amvlase (EC 3.2,1. D fl.4-alDha-D... 


0.0 


66 


DirnreflNFOOl/62031 AlDha-amvlase I Emenceiia niauiansi 


0.0 


63 


pimref|NF01752634l alpha-amylase precursor [Lipomyces starkeviL 


0.0 


60 


Dirnref INF00756572I unnamed Drotein Droduct fThermomvces 


0.0 


60 


lOllUUII 1... 


pirnrGr| iNruuHuy Lipomyces Kunuriciir\uac buusu. 


e-180 


57 


SDencermartinsi... 


DirnreflNF00186159l Abha-amvlase 1 Drecursor (EC 3.2.1.1) (1.4-.. . 


e-180 


56 


DirnreflNF00490302l Abha-amvlase 2 Drecursor (EC 3.2.1.1) (1.4-.. . 


e-167 


56 


DirnreflNF00490307l Alpha-amvlase 1 Drecursor (EC 3.2.1.1) (1.4-.. . 


e-159 


56 


pirnreflNF00490293l abha-amvlase (EC 3.2.1.1) Drecursor rPebarv... 


e-158 


55 


pirnreflNF00490296l alpha-amvlase TDebarvomvces occidentaNsl 


e-158 


54 


DirnreflNF00155569l alpha-amvlase Isvnthetic constructl 


e-153 


53 


D. Query sequence = Hypocrea jecorina (Trichoderma reesei) exocellobiohydrolase 
I (CBH1, exoglucanase, cellobiohydrolases, 1,4,-p-glucan cellobiohydrolase). The 

following BLASTP hits with at least 50% sequence identity are all annotated as 
exocellobiohydrolase enzymes (except those noted as hypothetical or unnamed products): 


Abstract 


E-value 


% Identity 


pirnreflNF00769949l Exoalucanase I Drecursor (EC 3.2.1.91) (Exoc... 0.0 


100 


pirnreflNF01042178l cellulose 1.4-beta-cellobiosidase (EC 3.2.1.... 


0.0 


100 


DirnreflNF00494383| Exoalucanase I Drecursor (EC 3.2.1.91) (Exoc... 


0.0 


100 


pirnreflNF01470257l cellobiohvdrolase I ITrichoderma viride]_ 


0.0 


99 
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pi rn ref 1 N F0075663 1 1 Cellobiohvdrolase I rTrichoderma viridel 1 


0.0 


95 


DirnreflNF00756635l Exoalucanase I orecursor fEC 3.2.1.91) fExoc... 


0.0 


94 


DirnreflNF00494360l Cellobiohydrolase I rHypocrea jecorinal 


0.0 


100 


pirnreflNF00494368l 1,4-Beta-D-Glucan Cellobiohvdrolase I I"Hvdoc... 


0.0 


99 


DirnreflNF00494367l 1,4-Beta-D-Glucan Cellobiohvdrolase I I"Hvdoc... 


0.0 


99 


oirnreflNF00494366l 1,4-Beta-D-Glucan Cellobiohvdrolase I rHvooc... 


0.0 


99 


oirnreflNF01524434l Exocellobiohvdrolase I rHvDOcrea iecorinal 


0.0 


99 


DirnreflNF00494322l 1.4-Beta-D-Glucan Cellobiohvdrolase Cel7 THv... 


0.0 


98 


pirnref|NF01524433| ExocellODionydrolase l Inypocrea jeconna| 


0.0 


97 


|oirnreflNF00756590l Cellobiohvdrolase | Hypocrea lixiil 


O.U 


on 
oU 


pirnreflNF01265187l unnamed orotein oroduct TAcremonium 


0.0 

-. 


63 


U ICI 1 1 IU|JI 


nirnrpf IMP01 Q1 1 nnn^mpH nrnfpin nrnHnrt" rPhapfnrnirliurTi 

UN 1 II Cl 1 INl Ul^UJ ±-7 X | UI 11 lal 1 iCU |JIVJICIII \J\ UUUtl | ^-1 luCLUI 1 IIVJIUI 1 1 


0.0 


62 


oinatu... 


nirnrpflNF00835152l Xvlanase/cellobiohvdrolase orecursor fEC 3.2... 


0.0 


62 


nirnrpf I MF01 9ftftQ1 m imnamprl nrnfpin nrnHnrt FFYiHia nlandulo^al 

|JII 1 II CI | IMrUX^007l J 1 UllllalllGU |JI ULCH i pi UVJUl^L ^L.AIUia t^iai lUUIUdu i 


0 0 


61 


pirnreflNF00626501| 1,4-beta-D-alucan ceiiooionydroiase b precur... 


0.0 


59 


pirnreflNF01266275| unnamed protein oroduct fChaetomium 


0.0 


58 


LUCI 1 1 IU[JI I. ■ . 


nirnrpf INF0 1 4RQQ84I Hx/nnthpfiral nrn+pin rNpuro^nnra cra^sal ' 


0.0 


59 


pirnref|NF01258404| unnamed protein product |5cytahdium 


0.0 


[ — 

57 


frhprmon 

LI ICI 1 1 ll/L/i ■ • 


nirnmf IMPnnQQ94fi1 I 1 4-hpi*a-n-nh iran-r*pllnhinh\/Hrnl\/aQP ^FP ^ ^ 
pirnrci | iNrUUi7"^"Di | uclo L/ yiucaii v-ciiuuiuiiyuiuiyaoc ^l-v.* ^.^. a .. 


0 0 


58 
***** 


nirnrpflNFnn7^fV397l TpllnlaQP fFr ^ ? 1 Q1^ rHumimla nri<;pal 


0 0 


57 


nirnrof INF01 7flfi4^^l nnnampH nrnt'pin nrorliirt rThprmnaQru^ 

pil 111 CI | IN ■ UltOO"JJ | UIHICHIICU piULCIII \Jl UUULL |^ 1 1 ICI 1 1 IUOjLUj 


0.0 


65 


auranti... 


oirnreflNF00756321l Exoalucanase I orecursor fEC 3.2.1.91 1 ) fExoc... ' 


0.0 


57 


lDirnreflNF00801194l Cellulase CEL7A TLentinula edodesl 


0.0 


60 


oirnreflNF01257476l unnamed Drotein oroduct IThielavia australie... 


0.0 


56 


nirnrefll\IF00625663l Exoalucanase I orecursor fEC 3.2.1.91) fExoc... 


0.0 


E 58 

i,_ .... 


nirnrpflNF0n962307l Cellobiohvdrolase I IThermoascus aurantiacusl 


e-180 


| 64 

I. 


nirnrof 1 MFnnfi^fififiAl Ppllnhinhv/HrnlaQP FThprmna^ri aiirantiaru^l 
piilllcl | INrUuO jQUuu | V-Ciiuuiuiiyui uiaoc |^ i nci iiiuaoi-uj auiaiiLiav.u3| 


e-179 


64 


oirnreflNF00959024l Cellobiohvdrolase I catalvtic domain fEC 3.2... 


e-179 


64 


DirnreflNF01709668l GUXC FUSOX Putative exoalucanase tVDe C 


e-179 


57 


Dree... 


DirnreflNF01053514l Cellobiohvdrolase C TAsDeraillus orvzael ! 


e-179 


63 


pirnreflNF00755639l Putative exoalucanase type C precursor (EC 3... 


e-179 


57 
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Di rn ref 1 N F006264 1 3 1 1.4-beta-D-alucan celiobiohvdrolase A precur... ; 


e-177 


65 


DirnreflNF00627000l Exoalucanase I Drecursor (EC 3.2.1.91} fExoc... 


e-177 


57 


DirnreflNF01696049l celiobiohvdrolase C TGibberella zeael 


e-176 


57 


nirnrpf INIFH1 9ftRQ1 41 nnnampd nrntpin nrodurt rFyidia olandulosal 


e-176 


63 


nirnrpflNF0f)fi4fi477l FYoalucanase I Drecursor (EC 3.2.1.91} fExoc... 


e-174 


62 


nirnrpflNF01581876l hvoothetical orotein MG06834.4 rMaanaDorthe 


e-174 


62 


■ ■ ■ 


pirnreflNF00648798l Exoalucanase 1 Drecursor (EC 3.2.1.911 (Exoc... : 


e-170 


56 


DirnreflNF01265188l unnamed Drotein Droduct ITrichoDhaea saccatal ! 

M , r . _ _ _ , ^ _ ..~ . ~ . . . „ . ^ ... - ■ ■ - ■ - - - - - — ■ ■■ - - ■ — — ■ - -- * h ■ ■* ~~ 


e-169 


60 


pirnreflNF00992462l 1.4-beta-D-alucan-cellobiohvdrolvase (EC 3.2... ; 


e-168 


60 


DirnreflNF01053511l Celiobiohvdrolase D TAsDeraillus orvzael 


e-163 


60 


pirnreflNF00731508l cellulose 1.4-beta-cellobiosidase (EC 3.2.1.... 


e-163 


54 


|pirnreflNF00731784l Cellulase Drecursor Tlroex lacteusl 


e-162 


52 


pirnreflNF00733334l Exoalucanase Drecursor (EC 3.2.1.911 (Exocel... 


e-162 


! '.54; 


[pirnref 1 N F00731509 1 cellulose 1.4-beta-cellobiosidase (EC 3.2.1.... 


e : 161 


54 


pirnreflNF00731785l Exocellulase Drecursor riroex lacteusl 


e-160 


53 


E. Query sequence = Hypocrea jecorina (Trichoderma reesei) endoglucanase I I 
(EG1, endo-l,4-p-glucanase, 1,4,-p-glucan glucanhydrolase). The following BLASTP 
hits with at least 50% sequence identity are all annotated as endoglucanase enzymes (except 
those noted as hypothetical or unnamed products): 


Abstract 


E-value 


% Identity 


pirnreflNF00494331l Endoalucanase EG-1 Drecursor (EC 3.2.1.41 (E... 


0.0 


100 


DirnreflNF01407727l Endoalucanase I ITrichoderma viridel 


0.0 


99 


pirnreflNF00756647l Endoalucanase EG-1 Drecursor (EC 3.2.1.41 (E... 


0.0 , 


94 


pirnreflNF00756639l Endoalucanase I ITrichoderma virideL 


0.0 


93 


pirnreflNF00154649l ENDO II Isvnthetic constructl. 


0.0 


97 


DirnreflNF00494347l Endoalucanase I THvDOcrea iecorinal 


0.0 


100 


pirnreflNF00793302l unnamed Drotein Droduct ITalaromvces 
emersoniil 


e-121 


55 


DirnreflNF00626671l Endo-1.4-beta-alucanase (EC 3.2.1.41 TAsDera... 


e-107 


51 


F. Query sequence = Coprinus cinereus laccase (polyphenoloxidase, bilirubin 
oxidase, multicopper oxidase). The following BLASTP hits with at least 50% sequence 
identity are all annotated as laccase enzymes (except those noted as hypothetical or unnamed 
products): 


Abstract 


E-value 


% Identity 
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Pi 


rnreflNF00733435l Laccase 2 fEC 1.10.3.2} TCoDrinopsis cinereal 


0.0 


100 


P' 


rnreflNF00733482l Laccase 3 (EC 1.10.3.2} TCoDrinopsis cinereal 


0.0 J 


78 


Pi 


rnreflNFO 16383551 laccase 3 TCoDrinopsis cinereal 


0.0 


78 


Pi 


rnreflNF01386173l Laccase 4 TEC 1.10.3.2") TPIeurotus saior-caiul 


0.0 


64 


Pi 


rnreflNF00731916l Laccase 2 Drecursor fEC 1.10.3.21 fBenzenedi... 


0.0 


64 


pi 


rnreflNF01386176l Laccase 5 fEC 1.10.3.2} TPIeurotus saior-caiul 


0.0 


66 


Pi 


rnreflNF0073 19011 Bilirubin oxidase (Laccase} TPIeurotus ostre... 


0.0 


64 


Pi 


rnreflNF01567552l laccase TPIeurotus ostreatusl 


0.0 J 


64 


Pi 


rnreflNF01386172l Laccase 2 fEC 1.10.3.2} TPIeurotus saior-caiul 


0.0 


67 


Pi 


rnreflNF01461741l laccase TRiaidoDorus microDorusl 


0.0 


65 


Pi 


rnreflNF01461740l laccase TRiaidoDorus microDorusl 


0.0 


65 


Pi 


rnreflNF01386175l Laccase 1 fEC 1.10.3.2} TPIeurotus saior-caiul 


0.0 


62 


Pi 


rnreflNF0073 19251 Laccase 1 precursor fEC 1.10.3.2} fBenzenedi... 


0.0 


62 


Pi 


rnreflNF01637249l laccase TPIeurotus ostreatusl fPleurotus pul... 


0.0 


62 


Pi 


rnreflNF00427931l Laccase Drecursor TFunalia troaiil 


0.0 


63 


|pi 


rnreflNF00758114l Laccase Drecursor fEC 1.10.3.2} rbasidiomyce^ 


0.0 


63 


pi 


rnreflNF00939119l PolvDhenoloxidase fEC 1.10.3.2} fLaccase 1} ... 


0.0 


63 


Pi 


rnreflNF00232919l PolvDhenoloxidase fEC 1.10.3.2} fLaccase 1} ... ; 


0.0 


63 


Pi 


irnreflNF01689789l laccase TPIeurotus ostreatusl 


0.0 : 


62 


p 


irnreflNF00993008l Laccase 2 fEC 1.10.3.2} rTrametes DubescensJ_ 


0.0 


64 


Ip 


irnreflNF00732955l Laccase fEC 1.10.3.2} TSchizoDhvllum communel 


0.0 


63 


Ip 


irnreflNF00731988l Laccase Drecursor fEC 1.10.3.2} fBenzenediol... 


o.o 


63 


p 


irnreflNF00731957l laccase fEC 1.10.3.2} A rTrametes versicolor! 


0.0 


63 


p 


irnreflNF0073 19681 lianinolvtic Dhenoloxidase fEC 1.10.-.-} 2 d... 


0.0 


63 


p 


irnreflNF01057965l Laccase 2 ITra metes versicolor! 


0.0 ' 


64 


p 


irnreflNF00044343l Laccase 2 Drecursor fEC 1.10.3.2} fBenzenedi... 1 


0.0 


64 


p 


irnreflNF00731635l Laccase Drecursor fEC 1.10.3.2} fBenzenediol... 


0.0 , 


63 


P 


irnreflNF00059532l Laccase Drecursor fEC 1.10.3.2) TCoriolus ve... 


0.0 


63 


P 


irnreflNF00731977l laccase I rTrametes versicolor! 


0.0 


64 


P 


irnreflNF00788162l Laccase TPvcnoporus coccineusl 


q.o 


63 


p 


irnreflNF00731989l lianinolvtic Dhenoloxidase rTrametes hirsutal 


0.0 


63 


P 


imreflNF00788163l Laccase Drecursor rPvcnoporus coccineusl 


0.0 


63 


P 


irnreflNF00945391l unnamed Drotein Droduct TunidentifiedX 


0.0 


64 


P 


irnreflNF00909761l Laccase fLaccase 1} TLentinula edodes]_ 


0.0 


64 


P 


irnreflNF00044346l Laccase 1 Drecursor fEC 1.10.3.2} fBenzenedi... ! 


o.o ; 


63 


P 


irnreflNF00731959l Laccase 2 Drecursor fEC 1.10.3.2} fBenzenedi... 


0.0 


64 
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Di rn ref 1 N F0096437 5 1 Laccase III (EC 1.10.3.2) ITrametes versicolorl 


0.0 


63 


DirnreflNF0073 19731 Laccase precursor (EC 1. 10.3.2) ITrametes ve... 


0.0 


63 


DirnreflNF00801188l Laccase B precursor (EC 1.10.3.2) ITrametes ... 


0.0 


63 


pirnreflNF01012946| Laccase (Trametes versicolorl 


0.0 


64 


DirnreflNF00050826l Laccase LCC3-1 (EC 1.10.3.1) l"Polvoorus cili... 


0.0 


63 


DirnreflNF00427925l Laccase (EC 1.10.3.2) rCorioloosis aallical 


0.0 


63 


DirnreflNF01470431l laccase ITrametes sp. 1-62] 


0.0 


63 


DirnreflNF01470433| laccase [Trametes sp. 1-621 


0.0 


63 


DirnreflNF00466979l Phenoloxidase (EC 1.10.3.2) ITrametes so. 1-621 


0.0 


62 


DirnreflNF01470432l laccase rTrametes sp. 1-621 


0.0 


62 


DirnreflNF00466977l Phenoloxidase fEC 1.10.3.2} rTrametes sd. 1-621 


0.0 


62 


DirnreflNFO 14704301 laccase ITrametes sp. 1-62] 


0.0 


62 


Dim reflNF0080 11891 Laccase 1 fEC 1.10.3.2} fTrametes versicolorl 


0.0 


63 


G. Query sequence = Thermomyces lanuginosus (Humicola lanuginose) lipase. The 

following BLASTP hits with at least 50% sequence identity are all annotated as lipase enzymes 
(except those noted as hypothetical or unnamed products): 


Abstract 


E-value 


% Identity 


DirnreflNF00756570l LiDase precursor fEC 3.1.1.3} (Triacvlalvcer... 


e-171 


100 


DirnreflNF00756566l LIPASE rThermomvces lanuqinosusl J e-159 


100 


pirnreflNF00756565l LiDase fE.C. 3.1.1.3} fTriacvlalvcerol Acvlh... 1 e-159 


100 


DirnreflNFO 11881781 LiDase rThermomvces lanuqinosusl 


e-158 


99 


DirnreflNF01102488l unnamed Drotein Droduct ITalaromvces 


e-153 


88 


thermop... 


DirnreflNF01111633l unnamed Drotein Droduct rThermomvces 


e-136 


78 


ibadane... 


DirnreflNFOl 1141921 unnamed Drotein Droduct ITalaromvces 


5e-97 


61 


emersoniil 


pirnreflNFOl 1057061 unnamed Drotein Droduct rTalaromvces 


2e-89 


57 


bvssoch... 


pirnreflNF00626823l unnamed Drotein Droduct rAsDeraillus tubinae... 


8e-77 


50 


DirnreflNF00158307l unnamed Drotein Droduct [unidentified! 


le-76 


50 
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APPENDIX 2. HMMPFAM Analysis of Aspergillus oryzae phospholipase 



Raw Viewer 

hercules . ngcsn . netRaw ViewerPortal 
New search 
Help - 

hmmpfam - search a single seq against HMM database 
HMMER 2.1.1 (Dec 1998) 

Copyright (C) 1992-1998 Washington University School of Medicine 

HMMER is freely distributed under the GNU General Public License (GPL) . 

HMM file : /usr/novo/databases/online/bf_biocof e/pf am/Pf am.pf 1 

Sequence file: 

/usr/novo/projects/biocofe/tmp/20041118_230717J7095 . seq 
Query: unknown, 464 bases, A11C7610 checksum. 

Scores for sequence family classification (score includes all domains) : 
Model Description Score E-value N 



Phosphoesterase Phosphoesterase family 198.0 1.5e-55 1 

Parsed for domains : 

Model Domain seq-f seq-t hmm-f hmm-t score E-value 



Phosphoesterase 1/1 51 447 .. 1 526 [] 198.0 1.5e-55 

Alignments of top-scoring domains: 

Phosphoesterase: domain 1 of 1, from 51 to 447: score 198.0, E = 1.5e-55 

*->ieHvVilmqENRSFDhyfGtls . .gvrgeidavse . esnpl . f sDpn 
+e++V 1+ ENRSFD+++G +++g++++i++ ++ n + sDp+ 
unknown, 51 VENI VWLI LENRS FDNI LGGVRrqGLDNPINN - - GpFCNYKnASDPS 95 



unknown , 



slkiqfgkpvwesqwqggwdpdtgasfqalenqPFrf ndttegkPllag 
s k + +s + +d+ +s + ++ + ++g +++g 

96 SGKYCTQAKDYDSVF NDPDHSVTGNNLEFYGTYTPNNG-AIASG 13 8 



unknown , 



f rvqdlnHswydphsawngGr . nDrWlgadaettakavsgpqvMgyf krs 
+v +d+ ++ ++ nD+ + a+ + qvMgy++++ 

13 9 KWADQS GFLNAQ1NDY PKLAPEEATRQVMGYYTEE 174 



unknown , 



dipvyLwaLAdeFtlcDnyFcsvpGpTqPNRlyllsGtspfyDdSAKhvS 
++p + +L+deFt ++ +F+ vpGpT+PNRl 1+Gt + 
175 EVPTL-VDLVDEFTTFNSWFSCVPGPTNPNRLCALAGTA 



212 



unknown , 



. VegmdgdGtlkiandsPasaDGPKFvDGlTPDfYavntmnPpYqpssvps 
g G+ 

213 AGHGK 



217 



unknown , 



knGgpvlanpsiqkpplqgfnwsTipdrLdekGvsWgiYqeklpgtlyqg 
n++* +1 g + + i++ ekGvsW +Y ++ +g +++ 

218 -NDDDFLN YGI SSKS I FEAANEKGVS WLNY - DGTNGEFEPD 256 



klgf nqyvqyf kqnanplnYwkeysnshaplrladsrklkavrkf hiydl 
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unknown , 



+l+f 
257 SLFF 



+ +++ +++ +++ 
TYVNQTSRSNWPV 274 



unknown , 



unknown , 



ssFkkDvkngkLPqVSf iiPRYf DlllnganDeYmHPghdviaaGdkwik 
++F++D+ g LP+ S+i P + +n++ HP+ +v +G++++k 

275 ENFFQDAYLGVLPKFSYINP SCCGTNTNSM- -HPTGNV-SYGEVFVK 318 

evleaLlanpqvWnkt HivtYDEngGf yDHVppPvapvpnp . glvtvsd 
++++a+++ pq W ktll++tYDE gGfyDHVppP a +p+ ++ 
319 QIYDAIRQGPQ-WDKTLLFITYDETGGFYDHVPPPLAVRPDN1TYT 3 63 

idav . pGpgpf nif gf yGLGpRVPtlvISPwPskgGtvdhepnGPtpss . 
++++ g + f ++LG+R Pt VISP+ sk+G++ ++ G p ++ 
3 64 -ETAkNGQKYTL- -HFDRLGGRMPTWVISPY-SKKGYIEQY- -GTDPVTg 4 07 

. . . tf dHtSvLaf iekrFgLpsLpnisawrdavagdltst<-* 
++ ++ tSvL+++ +++++ + +a+ ++ + t 
unknown, 4 08 kpaPYSATSVLKTLGYLWDIEDFTPRVAHSPSFDHLIGTT 447 



unknown , 



-16- 



ine jcivxnw journal yoi.o no ; 4 pp.tuj^e^o,; .ivtto 



The rdation betw 



i'i. 



i,;.', 



■!.' 



Cyrus Chothia 1 and Arthur M.Lesk 2 
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Communicated by M.F.Perutz 

Homologous proteins have regions which retain the same gen- 
eral fold and regions where the folds differ. For pairs of 
distantly related proteins (residue identity —20%), the regions 
with the same fold may comprise less than half of each mol- 
ecule. The regions with the same general fold differ in struc- 
ture by amounts that increase as the amino acid sequences 
diverge. The root mean square deviation in the positions of 
the main chain atoms, A, is related to the fraction of mutated 
residues, H, by the expression: A(A) = 0.40 e 187H . 
Key words: evolution/protein homology/model building 



Introduction 

The comparative analysis of the structures of related proteins can 
reveal the effects of the amino acid sequence changes that have 
occurred during evolution (Perutz et aL, 1965). Previous work 
on individual protein families has shown that mutations, insertions 
and deletions produce changes in three-dimensional structure 
(Almassy and Dickerson, 1978; Lesk and Chothia, 1980, 1982, 
1986; Greer, 1981; Chothia and Lesk, 1982, 1984; Read et aL, 
1984). Here we report a systematic comparison of structures from 
eight different protein families. This shows that the extent of the 
structural changes is directly related to the extent of the sequence 
changes. 

In the^ work reported here we used the atomic coordinates of 
25 proteins (Table I). All these structures have been determined 
at high resolution (1.4— 2. OA) and refined. The errors in their 
co-ordinates are 0.15— 0.20 A (see references given in Table I). 
The 25 proteins represent eight different protein families and pro- 
vide 32 pairs of homologous structures. 

Methods and Results 

The conserved structural cores and the variable regions of hom- 
ologous proteins 

The structures of homologous proteins can be divided into those 
regions in which the general fold of the polypeptide chains is 
very similar and those where it is quite different. In comparing 
protein structures it is useful to separate the parts that have similar 
folds from those where the folds differ. We did this using the 
following quantitative procedure: (i) the main-chain atoms of 
major elements of secondary structure — helices or two adjacent 
strands of j$-sheet — were individually superposed; and (ii) each 
superposition was then extended to include additional atoms at 
both ends. The extension was continued as long as the deviations 
in the positions of the atoms in the last residue included were 
no greater than 3 A. This procedure defined the segments that 




SO 60 40 20 
Sequence Identity (%) 



Fig. 1. Size of common cores as a function of protein homology. If two 
proteins of length n! and n 2 have c residues in the common core, the 
fractions of each sequence in the common core are c/nj and c/n 2 . We plot 
these values, connected by a bar, against the residue identity of the core 
(see Table II). 
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Fig. 2. The relation of residue identity and the r.m.s. deviation of the 
backbone atoms of the common cores of 32 pairs of homologous proteins 
(see Table H). 
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Table I. Homologous proteins determined at high resolution 
'Family ' ' Proteta 



Abbreviation 



S tructure analysis 
Resolution (A) 



R factor, % 



Globins (deoxy) 



Cytochromes 



Serine protease 



Dihydrofolate reductase 



Cu-electron transport proteins 



Sulphydryl protease 



Lysozyme 



Imrnnnoglobulin domains 



Human a subunit 
Human ft subunit 
Sperm whale myoglobin 
Erythrocruorin . 

Tuna c . 
Rice embryo c 
Bacterial c 2 
Bacterial c 551 . 

Bovine -y-chymotrypsin 
Bovine trypsin 
S. griseus protease A 
5. griseus protease B 

L casei 
E. coli 

Bacterial azurin 
Poplar leaf plastocyanin 

Papaya papain 
Kiwifroit actinidin 

Human 

Hen egg white 

VX (RHE) 
VX (KOL) 
V 7 (KOL) 
CX(KOL) 
C 7l (KOL) 



HHBor 
HHB/3 
1MBD 
1ECD 

3CYT 
1CCR 
3C2C 
351C 

2GCH 
3PTP 
2SGA 
3SGB 

3DFR 
4DFR 

1AZA 
1PCY 

PAP 
2ACT 

iLZl 
LZHE 

2RHE 

KLVL 

KLVH 

KLCL 

KLCH 



} 



1.74 


io 


1.40 


14 


1.40 


18 


1.50 


''■ 17 


1.50 


19 


1.68 


" 17 


1.60 


19 


1.90 


18 


1.50 


16 


1.80 


14 


1.80 


14 


1.70 


15 


1.70 


■ 17 


2.00 


19 


1 .ou 


17 

- Mr t 


1.65 


16 


1.70 


17 


1.50 


18 


1.60 




1.60 


15 


1.90 


19 



Reference 



Fermi et al. , 1 984 , . . • ' 

Phillips, 1980 " ' : : 

Steigemanri and Weber, 1979 

i 

Takano and Dickerson, 1982 
Ochi et al., 1983 . 
Bhatia, 1981 
Matsuura et at:, 1982 

Cohen et al, 1981 
Chambers and Stroud, 1979 
Sielecki et al., 1979 
Read et al., 1983 

Bolin et al., 1982 
Bolin et al., 1982 

Norris et al, 1983 
Guss and Freeman, 1983 

Kamphuis et a/., 1985 
Baker, 1980 . 

Artymiuk and Blake, 1981 
Grace, 1979 

Furey et al , 1983 
Marquart et al., 1980 



l ^ ; "; : |HFamuy 

' -:' " * '? " : ' 

: >>.lte^Giobin 

. - ■ ■ ■ ^s-v ■■ ;. 

■i lij; :■-■*. -\. ; j _ - 
■ ■J&"$ i ■ • 

' ■ ■ - ' ■ 
■ • if?. : Cytocl 

■ ■ ■ 

- 

-• 

■ ft:?' ■ 

■ 



Serin 



Ex.pt for hen egg .yso^e and papain, atomic coordinates were obt^ed from the protein data ^(Bernstein « 1977). 



have the same fold in both proteins. They include major elements 
of secondary structure and peptides that form the active site. We 
call the collection of such regions the 'common core The resi- 
dues outside the common core are in peripheral dements of 
secondary structure, in the loops between major elements of 
secondary structure, at the ends of helices, or in strands at toe 
edges tf/3-sheets (Lesk and Chothia, 1980, 1982; J Sreer, 1981; 
Chothia and Lesk, 1982, 1984; Read et al., 1984). 

The results of comparing the 32 pairs of homologous proteins 
are given in Table II. Pairs whose sequence identity is >5U/fe 
have 90% or more of the residues of the individual structures 
within the common cores. Pairs whose residue identity drops to 
about 20% have common cores that contain betwen 42% and 
98% of the residues of individual structures (Table JJ, Figure 
1) Proteins built of /3-sheets are at the bottom of this range and 
proteins built of a-helices are at the top. Compared with hehcal 
proteins, 0-sheet proteins contain proportionally fewer residues 
within secondary structures and more in loops, the regions par- 
ticularly susceptible to local refolding when sequences change. 

Structural divergence in the common cores of homologous 
proteins 

Although the core regions retain a common fold, they do undergo 
structural change as their sequences diverge. Mutations at the 
interfaces between secondary structures produce changes in the 
geometry of packing and, in the case of (J -sheets tort*}** 
changes in backbone conformation (Lesk and Chothia, 1980, 
1982 1986; Chothia and Lesk, 1982, 1984; Read et al., 1984). 
The overall extent of the structural divergence of two homologous 
proteins can be measured by optimally superposing the common 
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cores and calculating the root mean square difference m the pos- 
itions of their main-chain atoms, A. For the 32 homologous pairs 
of proteins in Table H the values of A vary between 0.62 and 

2.31 A (Table H). j . 

The exact value of A is, of course, dependent upon the pro- 
cedure used to define the common cores of homologous proteins. 
Inspection of the regions not in the common cores shows that 
thev usually have very different conformations. This is especially 
true of the larger loops. Thus modification of the procedure used 
here to define the common cores would only produce marginal 
differences 

Essentially similar results are obtained if, in place of a core 
derived for each individual homologous pair, we use a core com- ; 
mon to all members of a family. For example, in the cytochromes 
c(rice), c(tuna), c, and c 551 , a 48-residue core is common to aU 
four structures (Chothia and Lesk, 1984). Superpositions of this 
core in the four structures give the A values listed in Table 111. 
Compared with the A values for individual core comparisons, 
these A values are somewhat smaller for closely related pairs 
(in these cases the family core is smaller and more homologous 
than the pair core), but nearly equal for distantly related pairs 

(Table m). , „ 

The contribution to A from experimental error and from de- 
ferences in molecular environment can be estimated from the 
comparison of proteins whose structures have been accurately 
determined in different crystal forms, or in crystals that have 
more than one molecule in the asymmetric unit. The values ot 
A for five such proteins are between 0.25 and 0.40 A ^ (Table 
IT). The mean is 0.33 A: one half to one seventh of the a 
values reported here for homologous proteins. 
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"Table II.-- Cbmiiiori 'cores of homologous proteins: 
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' i'V s*' » <'" J * it ; J " ; " ■'i* 

i '"■ . ,r j * " •- , ; ' 




Family . ;r : '*".. 


•Protein pair^-"'-"— - 

. .. ■ * » 


ixCSiQUCo in pruiein pair - 


rvesiuues in : core 


r.m.s. uinerence 


Percentage f of core '-residues" *„-'. V- 










in core (A) 


that are the same in both stmctures 


Globin 


. HHBa:HHBjS 


141:140 






A A 

44 




HHBa:lMBD 


1/11.1 C2 

141: ID j 


1 "20 

ijy 


1.4j 


2/ 




HHBa:lECD 


l j1:1jo 


1x2 


2.25 


1j 


- 


HHB0:1MBD 


140: Ijj 


1/11 
14J 


I.jU 


oc 
2j 




HHBjS:lECD 


1 /(<. 1 1£ 

14o:1jo 


121 


Oil 
2.11 


on 
2U 




1MBD:1ECD 


153*136 


132 


1 67 


21 


Cytochrome c 


. 3CYT:1CCR 


1 (Yi -111 




U.OX 


JT7 




3CYT:3C2C 


1UJ:112 


QQ 

yy 


1.1J 


OO 




3CYT:351C 


1 t\1 . 01 

iuj:o2 


J / 


1,0J 


o< 

2j • 




1CCR:3C2C 


111:112 


mi 
1U1 


1 AH 
1.4/ 


A 1 

41 




1CCR:351C 


1 1 1 .01 
1 11.5/ 


DO 


1 

1.00 


0j4 
24 




3C2C:351C 


1 1282 


56 


1 50 


36 


Serine protease 


2GCH:3PTP 


2J0.222 




n qq 


4/ 


2GCH:2SGA 


1*2 <c. i o i 
2Jo:lol 


1 14 


1 AO 


OC 

2j 




2GCH:3SGB 


2jo:1oj 


1 1 £ 

1 10 & 


2. 14 


oo 
22 


■ 


3PTP:2SGA . 


11 1 . 1 0 1 

22 1 : 1 o 1 . 


112 


1 QA 

1.04 


OO 
22 




3PTP:3SGB 


Ill . 1 oc 

222: 185 


1 lit 
110 • 


i niC 
2.U0 


1 Q 

iy 




2SGA:3SGB 


181 185 


172 


0 89 


65 


Immunoglobulin domain 


2RHE:KLVL 


i i A.i in 

1 iU.l lv 


ins 


ft 8ft 


. ft 




2RHE:KLVH 


i i rt. 1 k 
1 1U:1Zj 


' Q"2 
OJ 


1 

l.OJ 


jU 




2RHE:KLCL 


1 lA.lftl 

11U:1U1 


JJ 


i <n 


lo 




2RHE:KLCH 


1 in. on 

l iu:yy 


4o 


1 /IT 
1 .4/ 


1 J 




KLVL:KLVH 


i in. iic* 

110:123 




1 £ 1 
1 .01 


1f\ 
J\J 




KLVL:KLCL 


i m. i m 
1 1U:1U1 




1.J0 


oo 
22 




VT \/T -VI r*u 


i in.no 
1 Iv.yy 


J 2 


1.j4 


1*7 
1 / 




KLVHiKLCL 


1 1 n. i ni 
11U:1U1 


jy 


1 CO 

1.52 


1/1 
14 




KLVH:KLCH 


i i n.no 
1 W.yy 


J 2 


1 Q< 
l.iJJ 


1U 




KLCL:KLCH 


101*99 


83 


1 36 


35 


Dihydrofolate reductase 




159- 161 


143 


1.29 


29 


Lysozyme 


1LZ1:LZHE 


130*129 


128 


0.70 


61 


Plastocyanin/azurin 


1PCY:1A2A 


99:129 


55 


2.31 


18 


Papain/actinidin 


PAP:2ACT 


212:218 


206 


0.77 


49 



Proteins whose structure has been determined in different environments 



Reference 



Trypsin inhibitor . 


58:58 


56 


0.40 


Wlodawer et al., 1984 


Tuna cytochrome c 


103:103 


103 . 


0.30 


Takano and Dickerson, 1981 


Azurin 


129:129 


127 


. 0.37 


Norris et al., 1983 


Rat protease 


224:224 


224 


0.25 


Anderson et al. y 1978 


Deoxy human haemoglobin 


287:287 


287 


0.30 


Fermi et al., 1984 



a See Table I for abbreviations. 
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The relationship between the divergence of sequence and struc- 
ture in the common cores of homologous proteins 

The divergence of structure as measured by A is a simple function 
of the fractional sequence identity of the cores (Figure 2). A least 
squares fit to the data in Table II gives the relationship: 

A = 0.40 e l t37H 

where A is measured in A and H is the fraction of mutated 
residues. For the 32 pairs of homologous structures in Table II, 
the values of A predicted by this equation are within 20% of the 
observed values for 23 pairs and within 28% for the other nine. 

The exponential form of the relationship arises because proteins 
accept mutations of surface residues more readily than mutations 
of buried residues. Closely related proteins differ primarily in 
surface residues, whereas distantly related proteins differ in both 
surface and buried residues (Table IV). The mutation of residues 



buried in the interior usually produces larger structural changes 
than the mutation of surface residues. Thus the tendency for 
changes in buried residues to lag behind surface changes results 
in an exponential relationship between sequential and structural 
change. 

Conclusions 

In a previous series of papers we have described the structural 
differences found in members of individual protein families (Lesk 
and Chothia, 1980, 1982. Chothia and Lesk, 1982, 1984). The 
differences in the common cores consist mainly of changes in 
the relative position and orientation of packed secondary struc- 
tures and, in the case of /3-sheets, some local changes in structure. 
We have shown here that the overall extent of these changes is 
directly related to the extent of the sequence differences. 
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: These results imply that the degree of success to be expected •;. 
in predicting the structure of a protein from its sequence using i --, 
the known structure of an homologous protein, depends upon 
the extent of the sequence identity (Lesk and Chothia, 1986). 
A protein structure will provide a close general model for other 
proteins with which its sequence homology is >50%. If the 
homology drops to 20% there will be large structural differences 
that are at present impossible to predict. 

However, the active sites of distantly related proteins can have 
very similar geometries (Lesk and Chothia, 1980; Chothia and 
Lesk, 1982; Read et aL, 1984). This is because of the coupling 
of the structural changes that has occurred during evolution (Lesk 
and Chothia, 1980). Thus the structure of the active site in a pro- 
tein may provide a good model for those in related proteins even 
if the overall sequence homologies are low. 



Table 10. Cytochrome c family. Root mean square difference in the position 
of main chain atoms of residues in the conserved structural core, A 



Protein pair 2 Core determined for Core common to four 

individual homologous cytochrome c 

oairs structures 





Core 


A (A) 


Residue 


Core 


A (A) 


Residue 




size 




identity 


size 




identity 








in core 






in core 








(%) 






(%) 


3CYT:1CCR 


103 


0.62 


59 


48 


• 0.38 


65 


3CYT:3C2C 


99 


1.13 


37 


48 


0.91 


48 


1CCR:3C2C 


101 


1.47 


41 


48 


1.01 


56 


3C2C351C 


56 , 


1.50 


36 


48 


1.39 


35 


3CYT:351C 


57 


1.65 


25 . 


48 


1.56 


31 


1CCR:351C 


58 


1.86 


24 


48 


1.66 


27 



a See Tahle I for abbreviations. 
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Table IV. The homology of buried and surface residues 


Protein pair 


Residue identity (%) 






Buried 


Surface 


Overall 




residues 8 


residues* 




S. griseus proteases A and B 


83 


52 


65 


Human and hen egg white lysozyme 


77 


52 


61 


Tuna and rice embryo cytochrome c 


77 


50 


59 


Human haemoglobin a and Chironomus 


21 


16 


18 


erythrocuorin 






17. 


IgG Kol domains VX and C7/ 


31 


11 



a Buried residues are those with accessible surface areas <20 A 2 . 
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A major goal of genome sequencing projects is the complete description 
of the function of all proteins. For most proteins sequenced in genome 
projects, an experimentally determined function is not available. Fortunately, 
evolutionary relationships can be exploited to predict the function of many 
other proteins from their amino acid sequence. The techniques for such 
predictions, sequence analysis by computational and database methods, are 
becoming increasingly sophisticated and are now an essential part of genome 

analysis. 
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Introduction 



Today, a sound PhD project in molecular biology often 
ends with the biochemical characterization and anal- 
ysis of a cloned and subsequently sequenced gene. 
In the hope of additional functional insights, as well 
as interesting structural and evolutionary relationships, 
the sequence of the gene is usually further character- 
ised by various computer methods. Often, the homolo- 
gies found are extremely helpful for functional predic- 
|gti&: Today, the analysis of an average gene of about 
'1 kilobase (kb), including open reading frame (ORF) 
prediction, database searches, multiple alignment, pat- 
|tem definition, and profile searches, might take sev- 
eral days or more, depending on the depth of the 
paralysis. Soon genome projects will produce several 
A ^. |iiundred kilobases of raw sequences a day. Traditional 
:v gp^uence-analysis procedures cannot adequately han- 
such a high rate of production. The resulting dual 
IgPgd^Uenges facing genome sequence analysis are both 
J Quantitative and qualitative: developing more efficient 

-I 



and increasing the scope of functional prediction. 
The work reviewed here represents the first steps on 
the way to meeting these challenges. 



•MP of the iceberg: genomic sequence data today 



in spite of the increased production of sequence data, 
's databases contain only a small fraction of 



the 



genome's complete sequence. Many eukaryotic 



l^iome projects, including the Human Genome Ini- 
are currently still assembling high-resolution 
[|Y s |cal maps [1,2], an essential prerequisite of sys- 
;3fctic large-scale sequencing. So the real flood of 
Sgmic data is still to come. 



Current output of genome sequencing projects 

The most significant stretches of genomic sequences 
known to date are sizeable parts of relatively small 
genomes, such as the nematode Caenorhabditis el- 
egans, the yeast Saccharomyces cerevisiae, and the 
bacterium Escherichia coli (Tables 1 and 2). In addi- 
tion, there exist collections of numerous smaller chro- 
mosome segments from many different genomes, not 
yet assembled into long continuous stretches. The 
largest continuous sequence determined so far is a 
2Megabase (Mb) stretch of chromosome III of C. el- 
egans [% with another 5 Mb expected later in 1994 
(R Durbin, personal communication). The next-largest 
pieces come from the European yeast genome project 
[4]: the complete sequences of five out of sixteen chro- 
mosomes will be available by the end of 1994 (Table 
1), giving a total of 2.3 Mb, or 15% of the yeast genome. 
Bacterial genomes are next, with several large E. coli 
segments [5-7,8*,9,10] resulting in continuous pieces of 
up to 500 kb (Table 2). The total E. coli sequence data 
already covers more than 60% of the genome (P Rice, 
personal communication). 

In less than 5 years, the genome projects of inverte- 
brates such as the nematode [11], plants such as cress 
[12], vertebrates such as the buffer fish [13°], and mam- 
mals such as human [1], or mouse [2, 14] will probably 
generate hundreds of Mb of sequence data. Remark- 
ably, the plans for genomic sequencing efforts appear 
to be on, or ahead of, schedule (Table 2) and are likely 
to meet the projected completion dates, provided the 
planned increases in funding are forthcoming. Com- 
pletion of the E. coli and yeast sequencing efforts is 
expected by 1996. The C. elegans genome will be 
complete by the end of this decade (Table 2). In the 
human genome project, the initial estimates for com- 
pleting sequencing were revised downward, from 2020 
to 2010 and, most recendy, to 2005 [1]. The method- 
ologies for analyzing these data and, in particular, the 
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Abbreviations 

3D— three-dimensional; EST — expressed sequence tag; kb — kilobase; Mb — megabase; ORF— open reading frame. 
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Table 1. Status of sequencing projects of model genomes. 



Species 



Size (Mb) Sequenced (Mb) a Sequenced (% of total)* 



Year of completion 1 ?. 




0.01-0.19 
0.12-0.16 
0.19 

0.6-1 .4 
2.8 
4.2 
4.7 

15 

0.31 

0.66 

0.84 
100 

90 
170 
3000 
3600 



0.3 
0.1 
2 
3 



1 
3 
3 
9 
19 



1 00% 
1 00% 
1 00% 

30% 
4% 
40% 
60% 

30% 
100% 
1 00% 
1 00% 
1% 
3% 
2% 
0.3% 
0.6% 



1981 
1986 
1991 



Organelles/Vinises 

Mitochondria (various) 
Chloroplasts (various) 
Vaccinia virus 
Prokaryotes 
Mycoplasmas 
Mycobacterium leprae 
Bacillus subtilis 
Escherichia coli 
Eukaryotes 

Yeast (Saccharomyces cerevisiae) 
Chromosome 111 
Chromosome XI 
Chromosome II 
Cress {Arabidopsis thaliana) 
Nematode {Caenorhabditis etegans) 
Fruit fly {Drosophila melanogaster) 
Mouse {Mus.musculus) 

Human (Homo sapiens) . 

. ; nucleotide seq uence database; redundancies (only partly excluded) would lead to a 
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The fast track to protein sequences the EST approach 

An attractive alternative to continuous sequencing is 
the random sequencing of reverse transcribed mes- 
senger RNA (cDNA) fragments, often called 'expressed 
sequence tags' (ESTs) [151. This approach is 'quick and 
dirty' in that it is initially limited to single sequenc- 
ing runs (no verification, no extension). However, it 
is ideal for very rapid identification of gene products 
as a first step toward elucidation of gene function, a 
major goal of genome projects. ESTs correspond to 
protein fragments of about 100 amino acids and can 
be used to obtain an expression profile for a particu- 
lar organism or a tissue, to identify exons or to pro- 
vide a glimpse into the molecular repertoire of vanous 
organisms. Already, a sizeable fraction of the ESTs se- 
quenced have sequence similarity to proteins of known 
function. In 1991 only a few hundred ESTs from human 
brain were known [151; these increased to more than 
6000 human ESTs by 1993 H6"), and probably up 
to 100000 sequences from several organisms will be 
available by the end of 1994. The rate of data produc- 
tion is so high that ESTs corresponding to most of the 
highly expressed human genes will probably approach 
saturation in 1995, many of them in private commercial 
efforts. Publicly available ESTs from various organisms 
are now in a specialized database, called dbEST [17]. 



Rising to the challenge: large scale sequence 
analysis ■ 

Coping with the analysis of rising amounts of sequence 
data has become a major scientific enterprise over the 
last few years, driven primarily by an abundance of in- 
formation and its frequendy incomplete digestion. Sig- 
nificant recent progress has been made in four main ar- 
eas- first, development of databases; second, advances 
in computer networking; third, improvements in infor- 
mation access software; and fourth, improvements in 
algorithms and methods of sequence analysis by com- 
puter. We review each of these in turn. 



Grow and link: development of databases 

The usefulness of sequence databases, suclv as 
EMBL/GenBank (nucleic acids) and Swissprot/PIK 
(proteins), is currendy limited by incomplete integra- 
tion into a coherent whole, and by incomplete links 
to other biological databases, such as GDB (genome 
maps) and PDB (three-dimensional [3D] structures;. 
A number of efforts are under way to provide more 
inter-database links (technical term: interoperability 
of databases), to add carefully annotated specialized 
databases, and to add value to existing databases by 
deriving additional information from them. 

For example, numerous cross-pointers have been 
added to the Swissprot protein sequence database 
[18]. Given a protein sequence entry in this database, 
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ORFs a 


References' 5 




2000 


n.d. 


131 


| ^chromosome II 


840 


387 


Feldmann et ai 


^j^t'cnWnosorne Xl 


666 


'331 


Dujon ef al. 


t'^nchia colfi 


500 


n.d. 


»•] 


chromosome III 


320 


170 


[4,49 # ] 


Styornegalovirus 


229 


190 




\ Vaccinia virus 


191 


74 




I i'^onvort mitochondrion 


187 


74 




Tobacco chtoroplast 


156 


109 






97 


92 


[54] 


fruit fly homeobox loci 


80 


n.d. 




: Bacteriophage A 


49 


63 




Mycobacterium leprae 


37 


12 


[55] 


Contig^EST collections 


kb 


pieces 




Mycoplasma contigs 


350 


650 


[56];CillevetefaA 


r ££endrhabditis elegans (nematode) EST 


n.d. 


4699 


[17,57,58] 


iHuman brain EST 


n.d. 


6000 


[16-] 


Arabidopsis thaltana (cress) EST 


* 

n.d. 


4512 


[17] 


On^a saf/va (nee) EST 


n.d. 


4231 


[17] 


Databases d 


kb or kaa a 


seqs a 


* 


Swissprot^kaa) 


11500 


33300 




tMBL^ithout ESTs (kb) 


155 000 


1 55 000 




cfbESTs [kb] 


-1 0000 


31 800 




PDB [kaa] 


444 


2200 





AT-' 



^Abbreviations: kaa, 1000 amino acids; kb, kiJobases; n.d., not determined, or not available to the authors; ORFS, number of open 
reading frames (predicted proteins); seqs, number of sequence entries.. b$elected references only are given; unnumbered references are 
iersohal communications. cThere are four neighbouring, slightly overlapping £. coli segments in the EMBL nucleotide sequence 
database — the longest one covers about 1 76 kb. Redundancy within and among sequence databases slightly reduces the numbers 
pf bases/genes given; PDB - Protein Data Bank of three-dimensional structures. eSimiiar numbers for PIR (proteins) and CenBank (DNA). 
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l^anss-references give information, if available and ap- 
plicable, on the nucleic acid sequence of its gene, the 
genomic map location of the gene, the enzyme com- 
mission functional classification, the three-dimensional 
structure, known homologs, known characteristic sc- 
ience patterns (PROSITE [19]), and so on. An ex- 
v ample of added-value databases is the database of 
J ^uence-structure alignments (HSSP) that provides 
i? u tople sequence alignments, position-specific infor- 
t^tion on the degree of conservation, and sequence 
^rch profiles [20]. Examples of specialized databases, 
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i carefully maintained by curators, are the databases 
I^KTs [17], E. coli databases [21,22], the worm (nema- 
J?^ C elegans) database (see above), and FLYBASE 
$0rosopbild) [231. There are many more, too numer- 
4|^ s to list. The urgent need for cross-references and 
p^roperabihty is illustrated by the fact that a directory 
^nolecular biology databases is a database in itself, 
|?g. : LiMB [24]. 



Dial-up information: wide-area computer networks 

Databases in themselves are of limited use unless 
they are easily accessible. Computer networks and 
information-retrieval software provide such access. Re- 
cendy, both the physical and logical infrastructure 
of national and international computer networks and 
network software have improved considerably. As a 
result, use of wide-area information resources has be- 
come an important activity for sequence analysis ex- 
perts. These resources allow the search, identification, 
and subsequent transfer of large amounts of publicly 
available material, including software and data in vari- 
ous forms. Information typically travels on Internet, the 
precursor of the planned 'information superhighways'. 
National and international laboratories provide various 
servers for automated sequence analysis [25]. Database 
updates can be performed using file transfer protocols 
(ftp). Resource browsers such as XMosaic™, devel- 
oped at the US National Center for Supercomputer 
Applications (NCSA) in Illinois, are valuable software 
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tools for navigating networks in search of data, arid in- 
formation. 

As more and more institutions offer data and ser- 
vices, the identification of relevant information re- 
sources becomes increasingly difficult. What is needed 
are client-server systems and information service bro- 
kers. In such systems, the user specifies the tasks us- 
ing local (client) software and sends out the request 
to a remote system (server) which then distributes 
the required actions over the appropriate and avail- 
able resources (for example, the /Hassle' prototype [26] 
being tested on EMBnet, the European Molecular Biol- 
ogy Network). The key point is that copies of all data, 
software and information are not, and cannot, be kept 
locally without considerable effort, and therefore con- 
venient access to network information services is an 
essential element for efficient sequence analysis work. 



Access to integrated information: software tools 

An interesting development in the direction of intel- 
ligent retrieval systems for genome data is AceDB 
(A C. elegans database, by Richard Durbin and Jean 
Thierry-Mieg), an integrated system developed for the 
nematode sequencing project, now also in use by 
other genome projects. AceDB incorporates many pro- 
grams that handle and analyze raw DNA and other 
sequence data, as well as map data, and contains a 
convenient graphical user interface with hyperlinks 
for data browsing. Other systems are under develop- 
ment, each with slightly different orientation, at the US 
National Center for Biotechnology Information (NCBI), 
the Genome Database Center at John Hopkins Medical 
School (GDB), EMBL, the German Cancer Research 
Center (DKFZ) and elsewhere. An excellent example 
of future client-server access to information across in- 
ternational boundaries is the network version of the 
Entrez software distributed by the NCBI that gives 
access to Medline literature citations relevant to se- 
quence databases. If integrated access to diverse infor- 
mation becomes generally available, the efficiency of 
sequence analysis work will be much higher. To make 
effective use of all the information, genome projects 
will have to develop and apply techniques for informa- 
tion services in parallel with sequencing technology. 



Progress report: methods of sequence analysis 

Common sequence analysis practiced by the casual 
user, using standard programs, tends to miss a sig- 
nificant fraction of the functional information in protein 
sequences. It is therefore important to see what can be 
achieved with new and sophisticated methods. The list 
of basic procedures (selection in Table 3) is gradually 
increasing as new algorithms are invented and old ones 
improved (reviewed in [27]). We review here some of 
the most interesting programs that have had a practical 
impact on the field. 



identification 6f'd^nreading\hames' 'ZZ ~l " "™ Z""'Z'~- 
The accuracy .of ORF prediction has been improved 
[28,29], relative to widely used programs such as Gen- 
mark BO], Genefinder [31], Grail [32] and the Staden 
package [33]. Frameshift detection and detection of 
other errors are an important technical issue, affect- 
ing the quality of derived amino acid sequences [34,35] 
Routine use will indicate which of these new methods 
have noticeable practical advantages. 



Analysis of amino acid composition bias 
Once a putative protein sequence is available, a num- 
ber of analysis methods can be applied. The best 
known, and most powerful, are sequence database 
alignment searches. However, an assessment of the 
significance of particular 'hits' (match of query with 
target sequence) depends strongly on the structural 
and functional class of the protein coded by that se- 
quence (globular, filamentous, transmembrane, etc.). 
In practice, the problem is twofold. First, standard 
measures of sequence similarity require different cut- 
offs depending on the amino acid composition of the 
sequences being compared. Second, larger proteins 
have an inhomogeneous amino acid composition, i.e. 
a distinctly different composition in different regions 
(e.g. hydrophobic or charged stretches or 'domains*). 
Thus, before doing alignment database searches, it is 
useful to first determine the composition bias, i.e. devi- 
ation from the typical composition of globular proteins 
or deviation from the average composition in the pro- 
tein sequence database. An approximate classification 
of different types of composition bias is as follows: 

Coiled-coil arrangements. Whereas the detection of 
such regions using the program of Lupas et ah [36] 
appears to be fairly accurate, no reliable distinction 
between two-stranded and three-stranded, or between 
parallel and antiparallel, coiled-coil regions is possible 
yet. 

Transmembrane regions. There are many programs 
for the prediction of transmembrane segments and sig- 
nal peptides. Most are based on the simple notion of 
detecting runs of hydrophobic residues. We are not 
aware of any recent substantial improvement in accu- 
racy. 

Other low-entropy (or low-complexity) regions. 

A particular amino acid composition may be the re- 
sult of functional selective pressure, e.g. a run of 
positively charged residues involved in non-specific 
protein-nucleic acid interaction. Recently, progress has 
been made in identifying heavily biased regions such 
as small repeats, long charged clusters, or regions 
rich in one, or a few, particular amino acids gener- 
ally atypical of globular proteins (Wootton, this issue, 
pp 413-421). Complementing the work of Karlin and 
associates [37], the programs 'Seg' [38*] and 'Xnu' [39*1 
fill an urgent need in this area. These methods iden- 
tify and mask composition-biased regions in the query 
sequence. The surrounding sequence and the biased 
regions can then be processed separately. 
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Step* > 


Problem 


References 3 


"Contig assembly 
•Error correction 

Open reading frame prediction 
Masking 

boiled-coil detection 
Hydrophobicity analysis 
database homology search 
Multiple alignment and tree 
Construction 

Dat3D3S€ proTtie or pattern aearui 
Self-alignment 

Secondary structure prediction 
Three-dimensional modelling 


Detection of DNA overlaps 
Identification of frameshifts, etc. 
Identification of putative genes 
Exclusion of regions with amino acid composition bias 
Recognition of coiled-coil areas 
Detection of transmembrane and signal segments 
Detection of sequence similarities 
Definition and analysis of protein families; definition 

of profiles 
Detection of distant relationshios 
Detection of internal repeats 
Prediction of helices, strands, loops, and surface/interior 
Construction of detailed atomic models based on homology 


[34,35] ' 
[28,29] 
[38 # ,39 # ] 
[36] 

[4T,43* # ] 

[4T,44,59] 

[45] 


aOnly references to some recent developments are included. 



: Derivation of amino acid comparison matrices 
'% alignment search methods use scoring matrices that 
assign similarity values for any pair of the 20 amino 
acids. A new scoring matrix, 'BlosunV, was derived 
from multiple-sequence alignments in a database of 
conserved regions ('Blocks' [40]). Tests indicate that use 
of the Blosum matrix leads to improved performance 
in homology detection [41*] and its use is therefore in- 
creasing. 



Database searches 

A search for sequence similarities in a protein sequence 
database is the most useful and most widely used 
■method of functional prediction. The limitations arise 
from the fact that derived amino acid sequences enter 
the databases only after some delay (due to processing) 
or not at all (due to errors in nucleic acid sequences, 
or in their interpretation). It is therefore useful, albeit 
costly, to search six-frame translations of nucleic acid 
sequences. One tool in the 'Blast* series [42] of fast 
search programs, TBlastN, does this very effectively. 
On occasion, this mode of search reveals homologies 
with very recently sequenced genes or with adjacent 
sequences in two different reading frames, usually evi- 
dence of a frameshift, possibly as a result of a sequenc- 
ing error. The reverse operation, searching the protein 
sequence database with the six-frame translations of a 
newly sequenced gene using BlastX [43**], is extremely 
useful for characterizing protein coding regions in raw 
nucleic acid sequence data, thus complementing ORF 
prediction programs. 

If several putative homologs are detected in a database 
scan, multiple-sequence alignment can reveal common 
functional motifs. A new method for the automatic de- 
tection of motifs in a set of sequences [44] now offers 
an alternative to standard programs. The use of pro- 
ves derived from multiple-sequence alignments also 
Jead to improved secondary structure prediction [45]. 

Respite the progress in basic analysis techniques, the 
Interpretation of apparently significant sequence sim- 



ilarities in functional terms is still an underdeveloped 
area. More sophisticated methods for the analysis of a 
set of sequences in a family are needed for a more ac- 
curate homology-based prediction of protein function. 



The key goal: prediction of protein function 

When all the analytical procedures (Table 3) have been 
applied to a newly determined protein sequence of 
unknown function, one faces the difficult task of as- 
signing a putative function based on evidence from 
sequence similarities, pattern detection, and so on. 
This part of the analysis process cannot easily be 
embodied in an algorithm and therefore is the least 
automated. Typically, an attempt is made to interpret 
sequence similarity by transferring the functional infor- 
mation about one protein to the homologous relative. 
The problem is that prediction of function by analogy 
can be very precise and complete, or it can be very 
fuzzy and incomplete, depending on the data at hand 
and on the precise nature of the similarity. 

The simplest cases are those in which there is strong se- 
quence similarity to, for example, an enzyme of known 
function. The new protein is then predicted to have the 
same function as its homolog. But even in simple cases 
caution is needed. Sequence variation can have diverse 
consequences. Even highly similar proteins might have 
completely different functions. Striking examples are 
the eye lens crystallins, structural proteins which ap- 
parendy evolved recently from metabolic enzymes [46]. 
Furthermore, isoenzymes fine-tune the metabolism by 
varying only slightly in substrate affinity; the sub- 
strate specificity of clearly homologous proteins can 
be changed completely, as observed in the differ- 
ent families of sugar kinases [47]. An unambiguously 
homologous protein may be the equivalent gene in 
another species, indicating direct lineage (ortholog) 
or the homology may merely imply descent after gene 
duplication (paralog). In the latter case, one has to be 
particularly cautious with functional predictions. 



In other cases, structural -or functional similarity, may---- 
occur for only a small part of a larger given protein (a 
domain), e.g. from the presence of a zinc finger motif. 
Although this may suggest that the protein binds DNA 
or RNA, little else can be concluded about the function 
of the protein of which this domain is a part. Finally, 
only a few proteins, mainly enzymes, are sufficiently 
well characterized so that molecular details of catalysis 
as well as higher levels of regulation and physiological 
roles can be described. 

These problem cases illustrate the need for more pre- 
cise knowledge about functional variation in evolution. 
Which functional changes occur as a result of cer- 
tain types and certain amounts of sequence change? 
More biochemical and genetic characterizations of sets 
of homologous proteins (or of engineered mutants of 
natural proteins) are needed before more precise and 
quantitative rules for function prediction by homology 
can be formulated. 



Yield: how much functional information from homology ? 

Considering the limitations described above, homo- 
logy- and analogy-based predictions have proved to be 
extremely powerful in exploring genomic information. 
When comparing the output of several large-scale se- 
quencing projects (Table 2), an identification of partial 
function has been possible for 40-65% of all predicted 
proteins (Fig. 1), with this figure showing an increas- 
ing tendency [16* # ,48,49*]. Remarkably, the number of 
tentative functional identifications by homology ex- 
ceeds by far the number of functional determinations 
by direct experiment (Table 4). The first complete chro- 
mosome sequences have led to the rather sudden real- 
ization that we already know a considerable fraction of 
all protein functions. In yeast chromosome III, the most 



-carefully analyzed eukaryotic chromosome to .date, trie- 
rate of tentative functional assignment by homology 
exceeds 50% (Table 4). 

Extrapolating into the future, we can expect a rapid rise 
in the probability of homolog detection when compar- 
ing a new sequence with all known sequences (Green 
this issue, pp 404-412). This is particularly true if the 
sequencing of human ESTs approaches saturation in 
1995, as has been predicted (M Adams, personal com- 
munication). However, the rate of functional prediction 
by homology will not rise nearly as fast. This is sim- 
ply because on the one hand new sequences enter the 
databases at a rapid rate, while on the other hand most 
of these new sequences come without primary, i.e. ex- 
perimental, functional information. So the gap between 
the total number of known protein families and those 
with identified function (Table 4) will increase consid- 
erably in the near future, before it ultimately decreases 
near the saturation limit. 

These observations have two implications for the al- 
location of resources in genome projects. To maximize 
functional information, it is probably advisable to com- 
plement sequencing effort with two key activities. One 
focus is a concerted program for the experimental 
determination of new protein functions, in order to 
increase the store of primary functional information 
on which all derived functional identifications depend. 
The other focus is a further improvement in the reliabil- 
ity of homology detection by 'sequence data analysis, 
in order to make maximum use of the experimental in- 
formation. Both activities are essential and interdepen- 
dent. On the one hand, even a single experimentally 
determined function can be immediately carried over, 
at least in part, to all of its sequence relatives and rep- 
resents a sizeable net gain in information, if it is of a 
new type. On the other hand, even a single percentage 



171 ORFs = 100% 




known function: 54% 



Fig. 1. Information content in the proteins 
of yeast chromosome lit [49*]. There is 
a 10% gap between all protein families 
('homolog/) and those of known func- 
tion, termed here the function-homo logy 
gap (see text and Table 4). Numerir 
caliy: 10% = 61% (homology) -54% 
(known function) + 3% (known function, 
no homology). 
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Identification of protein function by experiment and by homology, showing an increase in the function-homplpgy gap using 
^cch^ rom y ces cerevtsiae chromosome III as an example. r - - 



Yeast chromosome 111 Yeast chromosome III Anticipated increase 
(as of 1 /1 993) (as of 1/1 994) or decrease 3 



Speculative estimate 
for yeast {1995/6) 



11% 


14% 
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17% 


31% 


40% 


++ 


50% 


58% 


46% 




33% 


38% 


61% 


+++ 


80% 


62% 


39% 




20% 


1% 


10% 




15% 



function* 

function by experiment 
^Function only by homology 
/•jjo function yet 
Homology 

? Member of sequence family 
■•Mo family yet 
f unction-homology gap d 

^ease: slow (+), intermediate (++) and rapid (+++); same for decrease ( — ), ( ). *>Known full or partial function is given, identified 

h v direct experiment or by detection of homology to a protein of known function (data from [4 / 49 , ,50]). ^Experimental function 
represents an estimate, as there is some ambiguity in the definition of function;. e.g. 'temperature sensitive lethal' was not counted as 
known function, while 'DNA repair protein' was counted. ^Homology represents a significant sequence similarity to at least one other 
orotein in the sequence databases, thus defining a protein family of two or more sequences, with or without known function. 
d'Gap' represents the fraction of all known protein sequences which have at least one homolog (belong to a family), but for which the 
function is not yet known, and was estimated using the numbers in the table and the number of proteins with known function that have 
no known homolog; the reason for the increasing gap is the determination of numerous new protein sequences without known function. 



point improvement in functional prediction (currently 
at about 50%) translates to a large absolute number of 
identified protein functions, with immediate savings in 
experimental effort. 



Molecular detail: implied three-dimensional structures 

■ Detection of significant sequence similarity to a protein 
of known 3D structure immediately implies prediction 
of the 3D structure of the new protein by homology. 
The prediction leads to detailed arguments about the 
mechanism of protein action and about the role of 
particular residues. Often, very conserved residues dis- 
tributed along the protein chain are in spatial proximity 
in 3D, explaining previously puzzling conservation pat- 
terns and suggesting detailed experiments. It is there- 
fore interesting that as many as 19% of all yeast chro- 
mosome III sequences are significandy homologous to 
a known 3D structure (Fig. 1). For these proteins, ap- 
proximate atomic models can be built, inspected using 
3D graphics, and used as a basis for planning exper- 
iments. However, coverage is incomplete for certain 
types of protein structures, e.g. membrane proteins of 
which only a handful of 3D structures are known. 

As the rate of 3D structure determination of proteins 
now exceeds one structure a day, the rate of 3D pre- 
diction by homology will probably reach the 20-25% 
level in the not so distant future. It already exceeds 24% 
for the Swissprot database (R Schneider, personal com- 
munication). This number comes from a systematic 
a N-against-all alignment of the sequences in the Pro- 
tein Data Bank (known structures) with the 32000 
Se quences in Swissprot [20], assessing homology by 
toe application of a threshold for structural similarity. 
K is remarkable that we already have the opportunity 
^ to identify the approximate tertiary structure of one in 
,i '%: four newly sequenced proteins. 



Feedback: checking predictions by experiment 

The experimental verification of predictions by homol- 
ogy is very slow. The higher the belief in the validity 
of computer-based sequence analysis, the lower the 
incentive to perform control experiments. However, 
there is a small but non-negligible number of cases 
where prediction entices experiment, for a variety of 
reasons. One case is the surprising occurrence of a 
certain functional type in another organism. 

An example of this is the discovery of a certain type 
of polymerase in yeast. The homology and predicted 
functional analogy between yeast ORF YCR14C and 
mammalian DNA polymerase-fJ [48,50] was experimen- 
tally verified by in vitro tests on the expressed protein 
[51], now named yeast DNA polymerase IV ipol TV 
gene) — an example of a successful prediction based 
on multiple alignment and sequence pattern analysis 
[50,52]. This now opens the way to biochemical and 
genetic studies in yeast, not just mammals, to under- 
stand the presumed role of the short gap-filling activity 
of ^-polymerases in DNA repair. 



Improvements: what to expect 

The main gain in the level of function prediction over 
the last 2 years came primarily from three sources. First, 
protein-sequence databases have improved in quantity 
(more sequences, more frequent updates) as well as 
in quality (better annotation and more cross-point- 
ers). Second, multiple-sequence alignment methods 
have improved, especially profile and pattern defini- 
tion for protein families and their detection in new 
sequences. Third, diverse information resources are 
more conveniently accessible over the Internet, low- 
ering the threshold for careful analysis, and controls 
and cross-checks for borderline cases. Full use of these 
improvements typically requires expertise. 
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Fig. 2. Sialidase sequence motif in yeast 
proteins [49*]. Atypical example of short 
and weak motifs that are not detectable 
by standard homology searches. The mo- 
tifs are, however, detectable by profile 
' and pattern searches. The significance of 
the sialidase motif is supported by its mul- 
tiple occurrence with an average spacing 
of about 50 amino acids. As a result, 
a sialidase activity is predicted for the 
yeast proteins. As a 3D structure of one 
bacterial sialidase is now known (Fig. 3), 
structural, as well as functional, informa- 
tion can be inferred for the yeast proteins. 



An example of a result obtained with more sophisti- 
cated methods in the context of large-scale genome 
analysis is the identification of sialidases in yeast chro- 
mosomes III and II. During the analysis of chromosome 
III [49*,50] the putative ORF YCR100 initially did not 
match any database protein apart from PEP1, a func- 
tionally uncharacterized protein from yeast chromo- 
some II. However, the conservation profile between 
both proteins revealed four short, but conserved, in- 
ternal repeats. Using this information in pattern search 
methods [52], the investigators detected subtle similar- 
ities to conserved repeats in bacterial and protozoan 
sialidases (Fig. 2). Very recendy, the mree-dimensional 
structure of one of these sialidases was determined 153J 
(Fig 3) and indeed, it contains internally repeated 0- 
sheets forming a superbarrel or propeller fold, fully 
consistent with the sequence repeat; the most con- 
served residues (Fig. 2) are located in equivalent po- 
sitions in the respective sheets (Fig. 3). Thus, based on 
a short signature motif (too short and weak to be de- 
tected by conventional homology search programs) a 
rather precise functional and structural prediction was 
made. This example emphasizes the need to incorpo- 
rate such methods into standard analysis procedures 
and illustrates the potential gain. 

In summary, for the immediate future the most press- 
ing and promising needs of genome sequence analysis 
are manifold. First, further refinement of pattern and 
profile searches. Second, automation of the analysis 
process, especially for sequence families. Third, un- 
proved data support by direct access to specialized 
sequence and bibliographic databases. Fourth, ear- 
lier public accessibility of data from major sequenc- 
ing projects. Fifth, training of analyzers in advanced 




Fie. 3. Ribbon plot of the sialidase from Salmonella typhimunum 
[53] It is an example of the so-called 0-propeller fold in wh.cn six 
four-stranded fJ-sneets form a superbarrel. The four sequence motifs 
are located in equivalent positions in four of the six fi-sheets. In the 
remaining two sheets, the corresponding motifs are not detectable 
(Fig. 2). 

methods. Sixth, the development of a more refined 
classification of protein function, and finally, a better 



understanding of the effect of sequence changes on 
"protein function. 

Con clusion ■ . ■ . 

a few years' time, the complete sequences of sev- 
eral entire genomes will become known, resulting 
in a series of historical achievements: E. coli, my- 
coplasmal), yeast, nematode, are proceeding apace 
(fable 1). Genomes from organisms of all taxonomic 
janks will follow, including 'Homo ignorans'. The 
analysis of these data will require a high degree of 
automation in sequence analysis without sacrificing 
the sensitivity of present methods for the detection 
of distant sequence similarities. Where sequence anal- 
ysis has no answers, experimental technologies will be 
essential for the genetic and biochemical characteriza- 
tion and the physical identification of completely new 
types of proteins. In addition, many experiments will 
be stimulated by the detailed predictions of sequence 
analysis. The database — or is it knowledge base? — 
of all proteins will gradually be completed, including 
3D structures and diverse functional knowledge. 



Imagine that at some time almost all proteins of a 
particular organism will be known — sequence, 3D 
structure and function. — and stored in the World- 
Prot database. The obvious question that then arises 
is whether the molecular repertoire of an organism is 
sufficient to characterize its physiological and evolu- 
tionary behavior. The obvious answer is that it is not, 
and that biological experiments and theories at higher, 
less microscopic, levels are needed to complement the 
atomic information available from genome-sequencing 
projects. 



Will a graduate student today, trained in sequencing 
and in sequence analysis, end his career like a zoolo- 
gist in the beginning of this century, hunting the last 
unclassified butterflies in Madagascar? In other words, 
will experimental and computational sequence analysis 
be transformed from a skilful scientific endeavor to an 
activity of lesser scientific interest? Or, will it provide 
the ultimate answers to the grand questions of biologi- 
cal science, about structure and function, development 
and evolution? The truth lies somewhere in between. 
The functional classification of all proteins will be an 
excellent intermediate goal for that graduate student, 
but also an excellent point of departure for addressing 
^e real questions of human health, the environment, 
; and the future evolution of life. 
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Predicting function from sequence using computational tools is a highly 
complicated procedure that is generally done for each gene individually. 
This review focuses on the added value that is provided by completely 
sequenced genomes in function prediction. Various levels of sequence 
annotation and function prediction are discussed, ranging from genomic 
sequence to that of complex cellular processes. Protein function is cur- 
rently best described in the context of molecular interactions. In the near 
future it will be possible to predict protein function in the context of 
higher order processes such as the regulation of gene expression, meta- 
bolic pathways and signalling cascades. The analysis of such higher 
levels of function description uses, besides the information from comple- 
tely sequenced genomes, also the additional information from proteomics 
and expression data. The final goal will be to elucidate the mapping 
between genotype and phenotype. 

© 1998 Academic Press 

Keywords: genomes; computational tools; function prediction; 
comparative genome analysis; proteomics 



Genomes and function prediction 

Prediction of protein function using compu- 
tational tools becomes more and more important 
as the gap between the increasing amount of 
sequences and the experimental characterization of 
the respective proteins widens (Bork & Koonin, 
1998; Smith, 1998). With the availability of com- 
plete genomes we face a new quality in the predic- 
tion process (Table 1) as context information can be 
utilized when analysing particular sequences. This 
review focuses on the added value of genomic 
information on the many steps of function predic- 
tion from genomic sequence. The first reports on 
completely sequenced genomes give an excellent 
overview of the evolving state of the art in the ana- 
lyses of particular genomes (Fleischmann et ah, 
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1995; Fraser et al, 1995, 1998; Himmelreich et al, 
1996; Goffeau et al, 1996; Kaneko et al, 1996 
Blattner et al, 1997; Tomb et al, 1997; Kunst et al, 
1997; Bult et al, 1996; Smith et al, 1997; Klenk et al, 

1997) . In addition, there are numerous reviews that 
touch on the extraction of functional features from 
sequence (e.g. Bork et al, 1994; Andrade et al, 
1997; Koonin & Galparin, 1997; Bork & Koonin, 

1998) , but very few reviews have been published 
that systematically summarize the additional infor- 
mation for function prediction that is provided by 
the presence of entirely sequenced genomes (orig- 
inal papers e.g. by Mushegian & Koonin, 1996a,b; 
Himmelreich et al, 1997; Koonin et al, 1997; 
Tatusov et al, 1996, 1997; Huynen & Bork, 1998; 
Huynen et al, 1997, 1998a; Dandekar et al, 1998b). 

What is function? 

"Function" is a very loosely defined term that 
only makes sense in context. Most current efforts 
aim at predicting protein function, but there are 
other types of function, e.g. RNA function or orga- 
nelle function, that also need to be explored. Even 
to describe "protein function" requires a broad 
range of attributes and features (Figure 1). Molecu- 
lar features such as enzymatic activity, interaction 
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Table 1. Added features from complete genome analysis for function prediction 

Genome specific patterns in the DNA and their usage in genome annotation 

Feature: Genome-specific (poly)nucleotide frequencies, codon usage 

Usage Identification of genes 

Identification of recent horizontal gene transfers into the genome 

Feature: Genome-specific signal sequences like regulatory regions, promotors 

Usage -» Gene identification, identification of the mode of regulation of genes, regulatory regions in mRNA, 

specification of the boundaries of genes 
-> Operon identification 

Usage of the complete set of genes in a genome and comparative genome analysis 
Feature: The finding of orthologs by comparative genome analysis 

Usage — > Narrowing down the function of a gene 

Identification of (conserved) regulatory signals neighbouring the orthologues 

Feature: Conserved genome organization 

Usage -> Genes in a conserved clusters have related functions, show physical interaction 

Feature: Differential genome analysis 

Usage -> Identification of the functions that are absent from a genome 

-» If an orthologous gene is absent, but the function is present, missing genes point either to a wrong 

annotation or a non-orthologous gene transfer 
~> Identification of the functions that are specific to a genome, and might be responsible for the species' 

specific phenotype, delineation of the mapping between genotype and phenotype 
-> Correlation in the patterns of occurrence of genes in the comparison of multiple genomes points to 

functional relations between the genes 

Feature: Complete list of detected gene sequences 

Usage Identifying the optimal candidate gene in the whole genome for an observed enzymatic activity 

Various types of patterns and (context) information that become available with the analysis of the complete genome can be used 
for function prediction at "lower levels", e.g. in the prediction of the function of single genes. 



partners, and pathway context are currently being 
predicted, but only qualitatively. Expression pat- 
terns, regulation, kinetic properties, localization 
and concentration effects and, even more so, dys- 
functions, environmental influence, fitness contri- 
bution or clinical symptoms can currently hardly 
be predicted. There is furthermore a relatively poor 
knowledge of the mechanisms of posttranslational 
modifications (Esko & Zhang, 1996). For example, 
although some sequence patterns for preferred gly- 
colysation sites are known, the prediction accuracy 
is still limited and the assignment does not include 
the kind of sugar or carbohydrate that is attached, 
so that most of the functional features of the 
respective proteins will remain hidden. 

The main goal will be to bridge the gap between 
genotype and phenotype (Figure 2), i.e. to under- 
stand the genotype to a degree that the phenotypic 
features can be predicted: What are the genes 
responsible for a certain disease phenotype and 
which proteins of the respective pathway (or an 
alternative one) are the best targets for a drug to 
be developed, or which variations at the DNA 
level are best suited for the respective diagnostics? 
Which genes have to be changed to achieve a 
desired phenotype? To answer such questions in a 
more general way, one needs a detailed under- 
standing of the function of higher order processes, 
including the complex interaction between the 
heritable part of the phenotype and the environ- 
ment. This will require a whole battery of novel 
types of experimental data with appropriate bioin- 
formatics support. 

Nevertheless, it is important to extract as much 
information as possible from sequence data using 



the already available (and inexpensive) compu- 
tational tools to guide experimental work. 

Functional prediction for gene products by 
annotation transfer from homologous sequences 

When homologies of a query are identified in a 
database search (Bork & Gibson, 1996), the anno- 
tated information of the homologue and the taxo- 
nomic, biochemical and /or molecular-biological 
context of the query protein are used to extrapolate 
possible structural and functional features of the 
query protein. This approach has proven extremely 
successful although, from a formal point of view 
the hypotheses generated must be experimentally 
verified (Eisenhaber et a\., 1995). The information 
transfer from well-studied proteins to uncharacter- 
ized gene products has to be done carefully since 
(i) a similar sequence does not always imply simi- 
lar protein structure (Sander & Schneider, 1991) or 
function (in particular in important details such as 
recognition loops) and (ii) the annotation of the 
database protein might be incomplete or even 
wrong. 

Often (particularly in the case of automatic pre- 
diction programs), the function is transferred from 
another member in a multigene family, but not 
exactly from the functional counterpart in a differ- 
ent species. Even orthologues (see below) can differ 
functionally in various organisms. It should also be 
emphasized that generally only the molecular func- 
tions of a protein can be transferred by analogy 
(Figure 1); it is rather rare that a particular 
sequence motif strongly correlates with cellular 
functions as in the case of the DEATH-domain, 
which is mainly contained in apoptose signalling 
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Figure 1. Characterization of pro- 
tein function. Whereas nucleic acids 
fulfil the tasks of storage, transfer 
/ • and processing of genetic infor- 

mation contained in the genome of 
living organisms, the- proteins (gene 
products) form a complex (single- 
or multi-) cellular machinery for 
the realization of this genetic pro- 
gram (resulting in the phenotype) 
in dependency and in response to 
changing environment conditions. 
Therefore, protein function requires 
also a multilevel, hierarchical 
description comparable with the 
notions of primary, secondary, ter- 
tiary, and quaternary protein struc- 
ture. Here, we propose a possible 
framework for functional character- 
ization and, for each hierarchical 
level, both functional features and 
attributes are described. It should 
be noted that, for most proteins, a 
quantitative functional characteriz- 
ation is still a matter of the future 

\ ^T^ f ^ < ^^^^sl' / anc *' toc * av ' a qualitative descrip- 

tion of function for at least some 
hierarchical levels can be con- 
sidered an achievement for many 
proteins. (1) Each protein has mol- 
ecular (elementary) functions; e.g. it 
can have specific binding sites for 
substrates, low-molecular effectors, 
nucleic acids or other proteins. 
Given the set of allowed allosteric 
conformational changes, of possible 
interactions with other molecules, 
and of kinetic properties, etc., the 
protein can, for example, catalyse a 
metabolic reaction, it may transmit 
a signal to other proteins or DNA 
or be able to fit into cytoskeletal 
macromolecular associates. Struc- 
tural properties of a protein are 
attributes for the execution of 
function, therefore, 3D structural 
information greatly facilitates 

understanding of function. Also, possible posttranslational modifications such as glycosylation, propeptide cleavage, 
or protein splicing are an important preposition for a protein to fulfil its molecular function. (2) A set of many co- 
operating proteins is responsible for a physiological (cellular) function (metabolic pathway, signal transduction cas- 
cade, structural associate etc.). The cellular function of a protein is always context-dependent and is characterized by 
taxon, organ, tissue, etc. Subcellular localization is an essential attribute for this level. For proper functioning, the pro- 
tein has to be translocated to the correct intra- or extracellular compartments in a soluble form or to be attached to a 
membrane. All types of regulation of protein activity are another attribute. For example, the amount of protein mol- 
ecules is often controlled via gene expression which might be limited to certain types of cells or tissues or to specific 
periods in the cell cycle or the individual ontogenese (expression pattern). (3) Finally, the totality of the physiological 
subsystems and their interplay with various environmental stimuli determines phenotype properties (phenotypic 
function), the morphology and physiology of the organism and its behaviour. Some phenotype properties may be 
traced to the activity of a single gene but most are determined by the co-operative action of many gene products. The 
absence of activity of a specific gene can result in phenotypic dysfunction. The knowledge of whole genomes will 
open a new era in the investigation of properties determined by many genes since the total set of genes influencing 
the phenotype is known. 





molecular function 
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proteins. Sometimes only the expression pattern 
and the tissue context determine the final function- 
ality (for example, high sequence identity and even 
gene sharing between metabolic medium-chain 
dehydrogenases and eye lens crystallins; 



Piatigorsky & Wistow, 1991; Persson et ah, 1994; 
Serry et aL, 1998). Proteins (or more precisely, their 
domains) as structural and functional modules are 
multiply adapted by evolutionary processes and 
re-used in a different context. Thus, higher order 
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Figure 2. Function prediction scheme: zooming in and out. Whether gene, contig or genome: current methods con- 
centrate on gene prediction and the annotation of individual genes that are then put into context. Due to our limited 
understanding of the genome, this is only possible by accessing complementary experimental information generated 
among others by proteomics research. Nevertheless, exploitation of genome information provides additional hints. 
This review follows the individual steps. 



functions should be analysed in the biological con- 
text of the organism considered. Unfortunately, the 
functional knowledge of proteins reflected in their 
annotation (Figure 1) is frequently incomplete, 
sometimes erroneous or inconsistent, and often 
only cellular or even phenotypic functions are 
listed. For example, the human glia maturation fac- 
tor (PI 7774) is described as growth factor (by defi- 
nition extracellular!) but an in-depth sequence 
analysis revealed ADP-domains characteristic for 
cytoskeletal proteins (intracellular!). 



Sequence and annotation quality in 
molecular databases 

Function transfer by analogy requires knowledge 
about the quality of sequence data and functional 
annotation. Concerns have been raised about an 
accumulation (Bork & Bairoch, 1996) and even an 
explosion (Bhatia et ah, 1997) of errors in sequence 
databases. 

In genome projects, two to tenfold sequence 
coverage is usually sampled. This is critical as 
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automated raw data acquisition (single read) is less 
than 99% accurate even when using optimized 
sequencers and software (Ewing et aL, 1998). Most 
of the ESTs (expressed sequence tags, i.e. single or 
sometimes double gel reads from cDNA) stored in 
current databases have much lower quality and 
require special caution as they are often also con- 
taminated by cloning vectors or DNA of other 
sources including non-coding regions. More 
reasonable accuracy (at least over 99.85%) in all 
regions can be achieved only by systematic mul- 
tiple coverage (Richterich, 1998). Nevertheless this 
will leave about one error per gene (mostly frame- 
shifts) leading to considerable deviations at the 
protein level. Unless the accuracy is above 99.99% 
(the majority of the reading frames are sequenced 
without any error), a considerable error rate should 
be considered in the analysis. 

Processing of raw genomic DNA includes identi- 
fication of genes, their exon/intron structure, and 
the "in silico" translation into protein sequences by 
automatic methods. Given the limited accuracy of 
eukaryotic gene prediction methods (Burset & 
Guigo, 1996; Guigo, 1997, see below) and the 
impact of organelle- and species-specific translation 
tables, of pre- (RNA editing and splicing) and 
post- (propeptide cleavage and protein splicing, 
side-chain modifications) translational changes, the 
sequence quality of a given genomic segment is 
expected to be lower in protein databases than at 
the DNA level. 

The value of sequences stored in databases is 
greatly increased by their functional annotation. 
However, automatic as well as manual annotations 
have all kinds of inaccuracies ranging from ortho- 
graphic errors, simple spelling ambiguities, and 
incompleteness to semantic mistakes (Bork & 
Bairoch, 1996; Eisenhaber & Bork, 1998; Smith & 
Zhang, 1997). Function assignments obtained as a 
result of automatic homology searches are often 
not labelled as such and cannot easily be distin- 
guished from true experimental data (Bork & 
Bairoch, 1996; Andrade & Sander, 1997). Further- 
more, there is a gap between the current database 
annotation and the knowledge embodied in the 
scientific literature (Bork & Koonin, 1998). 

Creating, updating, and correcting functional 
annotation is a costly effort absorbing a consider- 
able amount of manpower. At the moment, there 
is no real alternative to manual input from experts. 
In the future, text analysis systems might support 
this process by automatically extracting abstracts 
of related articles from literature databases and 
selecting relevant keywords and text units for pro- 
tein families (Guigo et aL, 1991; Guigo & Smith, 
1993; Andrade & Valencia, 1997). 

For analyses of genotype-phenotype relation- 
ships, the retrieval of complete sets of proteins 
from sequence databases with respect to their func- 
tion is necessary. This can efficiently be achieved 
only by categorized protein function descriptions 
(Riley, 1998) for cellular (subcellular localization, 
involvement in metabolic pathways, signal trans- 



duction cascades, etc.) and phenotypic functions. 
However, functions are currently annotated in the 
form of plain text incorporating a large variety of 
vocabulary for the in-depth description of particu- 
lar phenomena. Thus, they are not easily retrieva- 
ble with keyword search engines such as SRS 
(Etzold et aL, 1996). 

Computer-readable hierarchical systems of func- 
tion description as envisioned in Figure 1 might be 
helpful, but controlled vocabularies such as in FLY- 
BASE (ftp.ebi.ac.uk/ pub /databases/ edgp/ misc/ 
ashburner/fly_function_tree), the keywords in 
SWISS-PROT (expasy.hcuge.ch/sprot/), and, for 
catalytic functions, the system of Overbeek et aL 
(1997) put enormous pressure on the database 
curators. Such classifications also have to be 
adapted and updated frequently in accordance 
with the increasing understanding of the biological 
relationships. 

Rule-based automatic algorithms that parse writ- 
ten annotations for defined questions might be a 
solution since a much smaller effort (compared 
with database reformatting) is required for their 
updates. For the deduction of cellular localization, 
a system of about 1000 biological rules was able to 
classify 88% of entries of SWISS-PROT (currently 
seen as one of the best annotated general protein 
sequence databases; Bairoch & Apweiler, 1998) 
into subcellular localization categories. This is con- 
siderable progress given that only 22% of the 
entries can be retrieved using querying stems of 
keywords such as "extracell" or "membrane" 
(Eisenhaber & Bork, 1998). 

Annotating genomes 

Function prediction usually starts with already 
assembled genomic or cDNA data: at best a com- 
plete genome (Figure 2). Several features intrinsic 
to DNA can be recognized first, before identifi- 
cation of genes and pathways, although detection 
of the latter enhances also the annotation of non- 
coding features in genomes. 

Nucleotide frequencies 

Nucleotide frequencies are one of the oldest fea- 
tures of genomes that have been studied, even 
before sequencing was available (Chargaff & 
Davidson, 1955). Biases in nucleotide frequencies 
exist both within and between genomes, they have 
various uses in gene and function prediction. In 
warm blooded vertebrates and angiosperms, for 
example, the genome is divided in regions, so 
called isochores, that differ in G-t-C content. Iso- 
chores with a high G-hC content are relatively rich 
in genes (Saccone et aL, 1996). Biases in G+C con- 
tent can hence be used to find genes. A number of 
bacterial species show biases in the nucleotide fre- 
quencies of the leading and lagging strands in 
replication (Mrazek & Karlin, 1998; Freeman et aL, 
1998); these biases can be correlated with a bias in 
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the coding density, e.g. in Bacillus subtilis (Kunst 
et al, 1997). 

In the study of complete genomes, biases in 
nucleotide frequencies and codon usage provide an 
important clue for detecting recent horizontal 
transfers of genes into the genome (Figure 3; 
Medigue et ah, 1991). Variations in the codon 
usage can be described with a principle component 
analysis which divides the variation among 
orthogonal axes. Different axes correspond to inde- 
pendent sources of variation; the variation in 
codon usage that results specifically from horizon- 
tal gene transfer can be identified using auxiliary 
functional information from the genes (see below). 
In Helicobacter pylori for example, it is the first prin- 
ciple component that reflects horizontal gene trans- 
fer (Figure 3). On the basis of the variation in the 
codon usage in £, coli its genome has been pre- 
dicted to consist of at least 10-15% of recently 
horizontally transferred genes (Medigue et al., 
1991). Biases of nucleotide frequencies within a 
genome also reveal information about their func- 
tion apart from information about the evolutionary 
history of genes. Recently horizontally transferred 
genes are expected not to be involved in the core 
functions of the cell and to be relatively expend- 
able (they were generally not present before the 
transfer). In H. pylori the regions with deviating 
nucleotide frequencies can be related to pathogen- 
icity, or are prophages and /or are rich in insertion 
sequences (Figure 3). The same observation has 
been made in Haemophilus influenzae (Fleischmann 
et al, 1995; Huynen et al, 1997). 



Repeats 

For a large fraction of the DNA of multicellular 
eukaryotes no obvious function has yet been 
assigned. Most of it consists of repetitive elements. 
For example, Alu repeats may cover as much as 
13% of the human genome (Mighell et al, 1997). 
Repetitive, non-coding DNA should be filtered out 
as one of the first steps in function prediction to 
reduce the search space for the finding of genes in 
eukaryotic DNA (Jurka et al, 1996). Coding regions 
contain repeats too, but these are hardly identifi- 
able at the DNA level due to their divergence. 
They usually represent structural domains and 
should be detected at the protein level (see below). 
An exception are the trinucleotide repeats that are 
expanded in a number of disease genes (Chastian 
& Sinden, 1998); they can even specifically be used 
to search for such genes in DNA libraries (Pujana 
et al, 1998). 

In prokaryotes repeats are much less frequent. 
However, tetranucleotide repeats have been found 
in some virulence genes that increase variability by 
frameshift mutations (Hood et al, 1996). More 
strikingly, repetitive elements even have been 
found in what are probably the smallest bacterial 
genomes, those of mycoplasmas. These have been 
hotspots for genome rearrangements via recombi- 



nation, as can be deduced by whole genome com- 
parison (Himmelreich et al, 1997). 

Regulatory regions 

Regulatory regions can indicate when and how 
genes are expressed, repressed or co-expressed. 
Their computational detection is a powerful comp- 
lement to novel experimental approaches (see pro- 
teomics, below). If known structures provide a 
template, simple consensus searches, matrix 
approaches and also programs taking into account 
specific features, structural constraints and energy 
values are available (reviewed by Dandekar & 
Scharma, 1998). 

If no genomic template structures are available, 
neural networks (Demeler & Zhou, 1991; Pedersen 
& Engelbrecht, 1995; Ogura et al, 1997), language 
based approaches (Trifonov, 1996) and other non- 
consensus search methods are important (e.g. 
Tiwari et al, 1997). One can, for example, search 
for so-called CpG islands, which are, relative to the 
rest of the genome, abundant in the regulatory 
regions of mammalian housekeeping genes 
(Wirkner et al, 1998). The combination of artificial 
in vitro evolution and genomic screening is another 
powerful way to identify a regulatory motif when 
no template structure or sequence is available. The 
computer based genomic screen delineates how 
close the in vitro selection procedure comes to the 
situation in vivo (Dandekar et al, 1998a). 

The challenge from complete genome sequences 
is double: first, a comprehensive annotation of 
known regulatory elements using specific search- 
ing methods (i.e. either templates for particular 
elements such as promotors, attenuators, termin- 
ators and enhancers or RNA secondary structure 
fitting methods; d'Aubenton-Carafa et al, 1990; 
Brendel et al, 1986); second, the identification of 
novel elements using comparative analysis and 
experimental indications (co-expression, etc.). 
Knowledge of gene expression and changes in 
gene expression patterns at a complete genomic 
level may revolutionize drug discovery processes. 
An overview of the complete genome allows much 
better tailoring of drugs and the discovery of cor- 
rect, condition specific targets (Gelbert & Gregg, 
1997). 

A comparison of complete genomes identifies 
orthologous genes (see below). Their upstream 
regions can be screened for common regulatory 
signals in a much reduced search space. When co- 
expression patterns or functional interactions of 
genes are known, one can also search within the 
non-coding regions of a single genome. Unfortu- 
nately, regulatory regions in prokaryotes seem to 
be little conserved (Figure 4; Diaz-Lazcoz et al, 
unpublished), thus it is necessary to include several 
species to increase the signal to noise ratio via mul- 
tiple alignments. In the case of putative RNA struc- 
tures, one can utilize methods that include base- 
pairing information (cf. Chan et al, 1990; Han & 
Kim, 1993). These approaches, however, require 
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— Genes with a homolpg in the £. coli genome or a homolog in 
a non pathogenic E. coli plasmid 

Genes (pulatively) involved in interaction with the host 

HI Genes with a homolog in H. influenzae but not in Exoli 

— — Others: e.g. metabolism, transposases, restriction enzymes 

Genes with unknown functions 

I - VI clusters of genes with deviating codon usage 

codon usage 

Figure 3. Differential genome display of H. pylori versus E. coli, H, influenzae. The genes of H. pylori are divided into 
sets. Set 1 (green) are genes with a homologue in E. coli, set 2 (HI), genes with a homologue in H. influenzae but not in 
E. coli. Set 3 (red) are genes without a homologue in £. coli that are (putatively) involved in interaction with the host 
like virulence factors, outer membrane proteins and toxins. Set 4 (purple) are genes without a homologue in E. coli or 
H. influenzae that are not host interaction factors. A large fraction (63%) of the genes in H. pylori that have no homol- 
ogue in E. coli, but for which some functional classification is possible, can be considered host interaction factors. The 
star-figure in the centre gives the values of the codon usage of the genes on the first principle component in distance to 
the centre. The first principle component corresponds roughly to the usage of A and to a lesser extent T in 3D codon 
positions. Hence, genes with a high value on this axis have a relatively high A+T content in third coding positions. Six 
clusters (I-VI) of at least three consecutive genes with, on average, the codon usage that deviates the most from the 
genomic mean were further analysed. The genes in these tend not to have any homologue in E. coli or H. influenzae, 
their closest relatives for which complete genome sequences are available. This observation supports that the genes in 
clusters I-VI result from horizontal gene transfer into the genome. Proteins from I and VI are hypothetical proteins 
with no known homologues other than proteins in H. pylori itself. Region II contains homologues of VirB4, a virulence 
factor and of transposases. Region III is the CAG pathogenicity island, whereas region V again is rich in transposases. 
Region IV consists of three proteins, HP0611-HP0613. Sequence analysis reveals a frameshift that would merge HP0611 
with HP0612. The resulting protein is an ABC y type 2 transporter, the only one that can be observed in H. pylori. ABC-2 
transporters are involved in export of complex carbohydrates and play an important role in virulence. 
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Figure 4. Variability -of regulat- 
ory regions. Structure of the str 
operon and surrounding genes in 
eight Bacteria and two Archaea. 
The organisation of the str operon 
in Archaeoglobus fulgidus is essen- 
tially the same as in M. jannaschii, 
while the structure of the str 
operon in Mycoplasma pneumoniae is 
the same as that in M. genitalium. 
The arrows indicate the direction of 
transcription, the numbers under 
the arrows the lengths of the inter- 
genic regions. Genes in green are 
orthologous to the genes in the 
same location in E. coli. Genes in 
light-red are shared at that location 
between two or more genomes 
other than E. coli. Gene names may 
vary in the official genome annota- 
tions, but were kept constant in 
this Figure for clarity purposes. 
Operon structure is generally not 
well conserved in prokaryotic evol- 
ution. The conservation of two 
genes besides each other in all the 
prokaryotic genomes that have 
been sequenced thus far can be 
regarded as an exception. Note that 
the conservation of the str operon 
does not follow the standard phy- 
logenetic pattern: i.e. the operon 
structure in E. coli is more similar 
to that in the Archaea than it is to 
the operon structure in the closer 
related bacteria B. burgdorferi and 
A. aeolicus. In E. coli expression of 
the str operon is regulated by an 
RNA secondary structure located 
between the rpsL and rpsG genes 
(Saito & Nomura, 1994). A similar 
structure is present in H. influenzae, 
but is absent from the other 
species. Hence, regulatory elements 
appear even less conserved than 
gene order. 



longer and relatively strong signals. Weaker motifs 
can be identified using statistical approaches with- 
out prior alignment of sequences (cf. Staden, 1989; 
Hertz et al, 1990; Wolfertstetter et al, 1996). 
Reliable statistics require, however, many ortholo- 
gous sequences. The comparison based on ortholo- 
gous regions in" complete genomes, from different 
Gram-negative bacteria for instance, offers a new 
way to identify regulatory motifs without a precon- 
ception of the regulatory motifs revealed. In due 
course this and other approaches (see above) will 
improve quantitative predictions on expression and 
regulation in complete genomes and should also 
yield probabilities for tissue distribution of 
expression patterns and regulatory factors. 



Gene prediction 

The prediction of protein coding genes from 
DNA sequences can become a major bottleneck in 
genomics as currently there is quite a lot of infor- 
mation loss when genes cannot be identified cor- 
rectly. In eukaryotes the situation is particularly 
complicated, due to the generally low coding den- 
sity (probably as low as 2% in human) and the pre- 
sence of introns surrounding the relatively short 
coding regions. Various different, but weak signals 
have to be combined such as promotors, splice 
sites, translational start and stop sites; different 
knowledge-based methods complemented by hom- 
ology searches are applied to utilize them (Guigo, 
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1997) . However, an analysis of the accuracy of all 
available packages for the prediction of coding 
sequences for a region of human DNA showed 
a low accuracy for the prediction of coding 
sequences and specifically the prediction of intron/ 
exon boundaries (Burset & Guigo, 1996; Gelfand 
et al, 1996; Guigo, 1997; Lukashin & Borodovsky, 

1998) . 

In Archaea and Bacteria the situation is, relative 
to most eukaryotes, less complicated due to the 
almost complete absence of introns. In predicting 
protein coding regions, several methods make use 
of the information in the complete genome 
(Borodovsky et al, 1994; Fraser et al, 1998). Such 
bootstrapping methods use the genes that can 
easily be predicted, e.g. on the basis of the length 
of open reading frames and /or similarities to 
genes from other species to establish (i) taxon- 
specific patterns in codon usage, hexanucleotide 
frequencies and local complexity (information con- 
tent), and (ii) taxon-speciftc signal sequences like 
poly(A) signals, regulatory sequences such as ribo- 
some-binding segments (Shine-Delgarno segments) 
and promoters and start codons, etc. These pat- 
terns are currently implemented into Hidden 
Markov Models (HMMs) to predict the other genes 
in a genome. A method that relies on no a priori 
information to divide the genome into coding and 
non-coding regions has been shown to be success- 
ful (Audic & Claverie, 1998). These gene finding 
approaches should be complemented by another 
round of homology searches to find shorter or fra- 
meshifted genes that do not follow the codon 
usage of the organism. 

Annotating individual proteins 

Although homology searches are often already 
integrated into the gene prediction procedures, 
they are fully exploited only at the protein level 
with its higher sensitivity. Database searches are a 
standard technique for annotating proteins, but 
should be used in context with other methods 
(Bork & Koonin, 1998). 

Domain analysis 

Due to the modularity of many proteins, i.e. 
their multidomain architecture, the first step in 
functional annotation should be a scan for known 
domains in a query protein. Several databases exist 
that comprise patterns or profiles, i.e. fingerprints 
of already classified domains, and are well-suited 
for this first scan. Although somewhat redundant, 
they each have their individual strengths. PROSITE 
(Bairoch et al, 1997) is one of the oldest and prob- 
ably most widely used. It is well-annotated and 
covers more than 1000 different domains. With the 
inclusion of PROFILESCAN there is now also 
access to more than 250 domains that cannot easily 
be described with the classical PROSITE consensus 
string. A drawback, perhaps, is that the profiles 
are not yet fully integrated and most of them are 



not exhaustively annotated (which is a huge 
amount of work). BLOCKS (Henikoff et al, 1998) is 
derived from PROSITE and offers ungapped align- 
ments that are, in turn, used for a pattern matching 
approach which is more sensitive than the consen- 
sus string matching method of the original PRO- 
SITE database. 

PRINTS describes a protein domain with a set of 
several motifs separated along the sequence 
(Attwood et al, 1998). Version 17.0 is extensively 
annotated and comprises about 800 fingerprints 
with in total about 4500 motifs. 

PFAM contains a collection of accessible multiple 
alignments that are translated into hidden Markov 
models; version 2.1 is a large collection covering 
527 families that match at least once 47% of all 
SWISS-PROT entries in release 34 (Sonnhammer 
et al, 1998); it more sensitive than the classical 
PROSITE or PRINTS, has poorer annotation, but 
has many entries crosslinked to other domain data- 
bases. SMART concentrates only on mobile 
domains and hence is not exhaustive, but has a 
high sensitivity and selectivity, takes care of 
domain borders and provides additional annota- 
tion features (Schultz et al, 1998). Each of the data- 
bases offers search software on the web and there 
are efforts under way to overcome the difficulties 
of different formats and annotation styles. 

Intrinsic feature analysis 

Current database search techniques are all ham- 
pered by compositionally biased (low complexity) 
regions with a reduced residue alphabet. This 
includes (1) transmembrane regions: accumulations 
of ten hydrophobic residues in segments of length 
20 of non-homologous transmembrane proteins are 
treated as homologous, (2) coiled coil segments 
(widespread heptarepeats with patterns of hydro- 
phobic and polar residues) that pollute database 
search outputs with high scoring similarities to 
analogous (but probably not homologous) coiled 
coil regions in other proteins, (3) small repeats that 
lead to a bias in amino a,cid composition and (4) 
other regions with biases towards one or several 
amino acids such as proline-rich or glutamine-rich 
regions. 

Methods exist for the identification of all those 
features. Most of these use many sequences with 
the feature as training sets and identify the feature 
knowledge based. A general method for finding 
low complexity regions, SEG (Wootton & 
Federhen, 1996) is already integrated into BLAST 
(Altschul et al, 1997) in the form of a filter. Special 
types of composition bias can, of course, be pre- 
dicted better by specialised methods such v as coiled 
coil predictors (for a review see Lupas, 1997), 
transmembrane helix recognition (e.g. TOPPRED2, 
von Heijne, 1992) or even for subclasses of those 
such as signal sequences (e.g. SIGNALP, Nielsen 
et al, 1997). For transmembrane regions, a variety 
of methods exists with widely varying outputs. It 
is worrying that when using different methods for 
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genome analysis the results vary greatly, e.g. the 
fraction of transmembrane proteins in Mycoplasma 
genitalium was predicted to be 18% (Fischer & 
Eisenber g/ 1997), 24% (Koonin et al, 1997), 30% 
(Arkin et al, 1997) and 36% (Frishman & Mewes, 
1997). 

To avoid spurious hits and thus erroneous trans- 
fer of functional information, such regions can be 
filtered out, for example using the SEQ option in 
BLAST (Altschul et al, 1997; replaced by "neutral" 
Xes for "any amino acid"). One has to bear in 
mind though that also such residues can contain 
useful functional and structural information and 
need to be annotated. 

Another functional feature, especially in eukary- 
otic proteins, is the presence of posttranslational 
modifications. The EXPASY server provides useful 
software tools to detect and describe these 
(www.expasy.ch/www/tools.html). 

Homology analysis 

A classical database analysis should only be per- 
formed after identification and masking of 
domains and intrinsic features described above. 
This has the advantage of search space reduction 
and of better annotation quality. A database search 
using BLAST (Altschul et al., 1997) or FASTA 
(Pearson, 1998) often reveals significant homol- 
ogues, but this is then only the beginning of a com- 
plicated, and mostly manual transfer of functional 
information from the homologue in the database to 
the query sequence as one does not know how 
many of the functional features are shared (Doerks 
et al., 1998). Averaged over all species, the chance 
that a newly sequenced gene has a homologue in 
sequence databases detectable by BLAST is already 
above 70% (e.g. 84% for yeast chromosome III; 
Bork & Koonin, 1998; 70-85% for Bacteria and 
73% for Archaea; Koonin et al, 1997, but lower for 
animals), while the fraction for which some func- 
tional features can be predicted is at least 70% in 
Archaea and Bacteria (Koonin et al, 1997). For 
more than a third of all bacterial proteins, some 
homology-based fold assignments can be done 
with high confidence (Huynen et al, 1998b; M.A.H. 
et al, unpublished). Knowing the 3D structure of a 
protein is crucial in the understanding of the 
relation between sequence and function. In the 
case that amino acid identity levels to sequences 
with known 3D structures are higher than 50%, 
homology modelling can be used to further eluci- 
date the roles and interactions of individual amino 
acids (Johnsson et al, 1994; Eisenhaber et al, 1995; 
Sanchez & Sali, 1997; Rodriguez & Vriend, 1997). 
Other predicted structural features such as second- 
ary structure elements (Rost & O'Donoghue, 1997) 
can also be used in functional characterization. 
Characterization of a potential protein or RNA sec- 
ondary structure can help to assess whether an 
open reading frame codes for a protein or a 
sequence codes for a functional RNA structure 
(Huynen et al t 1996), respectively, or to test 
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hypotheses based on other, independent obser- 
vations. 

Only in the minority of cases can functional and 
structural features of a homologue be transferred 
to the query sequence as is (see above, Figures 1, 2) 
because often only some of the features are shared. 
Functional equivalence is only likely for ortholo- 
gues. 

Finding orthologues 

Orthologues (Figure 2, top right) are genes 
whose independent evolution reflects a speciation 
event rather than a gene duplication event (Fitch, 
1970). They are likely to perform the same function 
in various species, and hence represent a refine- 
ment over homologues in sequence analysis and 
annotation. Knowledge of the complete genome 
and of its protein coding regions improves the 
detection of orthologues. Orthologues are expected 
to have the highest level of pairwise similarity 
between all the genes in two genomes (Tatusov 
et al, 1996, 1997; Huynen & Bork, 1998), having 
diverged relatively recently compared to non- 
orthologous homologues. One needs to know all 
the proteins in two genomes to use relative levels 
of sequence identity to identify orthologues. 
Methods for the finding of orthologues rely both 
on relative similarity of genes from various gen- 
omes, and on information from the context of a 
gene in a genome. If two genes from different gen- 
omes share the same context, e.g. in the form of 
being a neighbour to a gene that also has the high- 
est pairwise similarity between the two genomes, 
this supports them being orthologues of each 
other. The comparison of the sequence tree and the 
species tree can help in identifying orthologues 
(Yuan et al, 1998), assuming that the genes have 
not been subject to horizontal transfer. Apart from 
information about the "functions" present in the 
genome, orthologues also provide information 
about the evolution of gene regulation. Specifically 
by comparing the 5' and 3' regions of orthologous 
genes one can obtain information about the evol- 
ution of promotors and operator/repressor 
sequences, and about the evolution of RNA sec- 
ondary structures involved in gene regulation (see 
above). Orthologues should be the basis of sub- 
sequent reconstruction of pathways, rather than 
proteins for which we only know that they are 
homologous. Within the current databases, only a 
minor fraction of homologous relations can be 
classified as orthologous and thus one has to incor- 
porate external data (Figure 2, left) for further func- 
tion characterization. 

Searching genes for a function 

A tool that further exploits the information from 
comparing genomes for function prediction is 
differential genome analysis (Huynen et al, 1997, 
1998a). The genes that are not shared between two 
genomes are probably responsible for species- 
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specific phenotypes, as can be shown in the com- 
parison of the pathogenic H. influenzae with the 
closely related but relatively benign E. coli. A large 
fraction (70%) of the genes in H. influenzae for 
which there are no homologues in E. coli and for 
which some functional annotation is possible can 
indeed be considered host interaction factors 
(Huynen et al, 1997). Also in the pathogen 
H. pylori the fraction of genes that is not shared 
with E. coli is relatively enriched in host interaction 
factors (Figure 3). Taking differential genome anal- 
ysis one step further one can show how gene con- 
tent correlates with phenotype in multiple genome 
comparisons (Huynen et ah, 1998a). Although the 
correlations between gene content and phenotype 
cannot be used to predict the function of specific 
genes, they can serve as a filter to select genes that 
are probably responsible for specific functions. Or, 
in other words, to search for "genes for a function" 
rather than to search for "functions for a gene". 

Incorporating proteomics data 

Proteomics focuses on the protein products of 
the genome and their interactions rather than on 
DNA sequences (Humphery-Smith & Blackstock, 
1997). It is thus complementary to the genomic and 
nucleic acid information (Kahn, 1995) exploiting 
novel tools such as 2D large scale analysis (Vietor 
& Huber, 1997) and powerful mass-spectrometry 
applications (Yates, 1998). 

Protein identification and gene expression 

Protein reading frames and expression behaviour 
in particular are not easy to predict from the gen- 
ome sequence and profit from incorporation of 
additional experimental data (Figure 2, left). Co- 
expression as well as tissue- and organ-specific 
expression patterns at genomic scale are inten- 
sively studied (Hieter & Boguski, 1997; Zhang et al, 
1997) and recent techniques collect data on a geno- 
mic level. 

Expressed sequence tag (EST) databases are 
available which contain information on gene 
expression that should correlate with the amount 
of redundancy, and on the tissue distribution of 
mRNA which can yield complex expression pat- 
terns (Boguski et al, 1994; Zweiger & Scott, 1997). 
However, retrieval of this information is hampered 
by the high sequence error rate, by different spli- 
cing variants and by the often missing 5' region 
necessary to determine the exact CDS start. 
Another caveat is that the EST approach has diffi- 
culties measuring genes with low expression. 

Serial analysis of gene-expression (SAGE, 
Velculescu et al, 1995) is a more rapid method to 
obtain partial sequence information from a very 
large set of expressed genes, e.g. differences in 
gene expression profiles in normal and cancer cells 
are identified by hundreds of differentially 
expressed transcripts, many of them growth factors 
(Zhang et al, 1997). DNA chip-based gene- 



expression screening procedures are currently the 
fastest approach. Polymorphism with single base 
resolution is detected within minutes in the entire 
human mitochondrial genome (16.6-kilobases) by 
applying 135,000 probes simultaneously (array 
generated by light-directed chemical synthesis) and 
a two-colour fluorescent labelling scheme (Chee 
et al, 1996). Systematic PCR of the entire yeast 
genome allows fluorescent readout of mRNA 
levels in different yeast environmental conditions 
such as changing glucose concentrations (DeRisi 
et al, 1997; http://cmgm.stanford.edu/pbrown/ 
explore). The correlation between mRNA and pro- 
tein expression level is, however, debatable. 
Anderson & Seilhamer (1997) give a correlation 
coefficient 0.48 for expression levels in human liver 
measured either by two-dimensional electrophor- 
esis (protein abundances) or by transcript image 
methodology (mRNA abundance measured by 
cDNA sequencing and cDNA clone count). 

Direct determination of the major expressed pro- 
teins may thus be an independent and attractive 
alternative. The huge amount of work involved in 
this can today be substantially reduced by apply- 
ing 2D gels and mass spectrometry and comparing 
experimental data to the annotated and predicted 
genome sequence. Link et al (1997) identify the 
major part of the proteins and protein complexes 
from H.influenzae (300 out of 400 spots) after liquid 
chromatography (LC) and separation of the protein 
cleavage products of each 2D gel spot in a first 
mass spectrograph (MS) and further analysis in a 
second (LC/MS/MS approach). Several proteins 
not annotated in the genome sequence were ident- 
ified by this approach. 

Posttranslational modifications 

After translation many proteins are further pro- 
cessed. This includes chemical modification of 
amino acids. Over 200 amino acid modification 
types are classified (Krishna & Wold, 1997), many 
more are expected (Annan & Carr, 1997). Such 
modifications are not apparent from the genome 
sequence, however, they are often critical for pro- 
tein function. Two-dimensional gel electrophoresis 
coupled to mass spectrometry and modern soft- 
ware allows not only peptide mass fingerprinting 
for low quantities (Kuster & Mann, 1998) but also 
specific detection of amino acid modifications on a 
large scale (Dongre et al, 1997). For this, a database 
has to cover many of the reading frames likely to 
be encountered in the protein mixture analysed by 
the mass spectrometer. The EXPASY server 
(www.expasy.ch/www/tools.html) comprehen- 
sively links 2D gel experiments (e.g. separation 
from pH 4.0 to above 8.0 in the first dimension 
and from M r 8-200 kDa in the second) to computer 
analysis tools. Nevertheless, determination of e.g. 
sugar modifications both by experiment and by 
software (e.g. EXPASY suite above) has limited 
accuracy, even including the kind of carbohydrate 
attached. 
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Figure 5. Protein interactions. 
Shown are ABC transporter pro- 
teins in the membrane of Gram- 
negative bacteria. Encoding genes 
are found as conserved gene clus- 
ters in the same sequential order in 
the complete genome sequences of 
E. coli, H. influenzae and H. pylori. 
They are an example of protein 
interactions predicted by triple 
comparison of complete genomes 
and additional confirmation by 
standard methods (see the text). 
A number of other protein inter- 
actions can also be suggested by 
comparative analysis of complete 
genomes. 



Predicting function in higher order processes 

Having predicted or determined functions for as 
many genes as possible and having assigned their 
interactions as well as their expression levels, it is a 
challenging task to put all the information into the 
context of cellular processes (Figure 1). A variety of 
databases and tools are emerging to support this 
procedure. 

Information on tissue distribution 

On the molecular level the processing machinery 
for metabolites differs in diverse tissues including 
absence of enzymes, receptors and structural pro- 
teins. On higher levels such as organ function, 
clinical impairment, drug metabolism or suscepti- 
bility to infections, tissue and phenotypic specific 
expression differences are key features of differen- 
tiation and help to find substances of therapeutical 
value. Data for humans are provided e.g. by 
TIGR (www.tigr.org/tdb/hgi/hgi.html), by NCBI 
(www.ncbi.nlm.nih.gov/UniGene/index.html and 
www.ncbi.nih.gov/dbEST/index.html), by SANBI- 
South African National Bioinformatics Institute 
(www.sanbi.ac.za/Dbases.html) and by the MRC 
human genetics unit (glengoyne.hgu.mrc.ac.uk). 
Such data should be used critically as low 
expression transcripts important for regulation 
such as tyrosine kinases may escape detection by 
EST sequencing or even Northern blots, and hence 
are misrepresented in databases. Techniques are 
still being improved (e.g. DNA chips (Brown, 
1994)) and many data are not yet on the Web 
or are even completely inaccessible (e.g. in 
companies). 

Analysis of protein interactions 

Prediction and analysis of protein interaction 
uses both experimental (e.g. antibody precipitation, 
Maniatis et aL, 1989) and theoretical approaches. 
Two hybrid screening systems (Tsukamoto et aL, 



1997) allow large scale screening, e.g. for 20 resi- 
due peptide sequences that correctly recognize (so 
called "aptamers") and inhibit cyclin-dependent 
kinase 2 (Colas et aL, 1997). Automation (with a 
considerable error rate though) and matching the 
data gathered with context and information such 
as common pathways is possible (Brent & Finley, 
1997). Logical connections of protein interactions 
(e.g. with ras protein) can be revealed by a careful 
choice of reporter plasmids (Xu et aL, 1997). 

A new way to identify protein interactions, com- 
parative analysis between genomes, has revealed 
that the conservation of gene order between gen- 
omes with less than 50% protein identity is limited 
to those genes that code for proteins that physi- 
cally interact with each other (Dandekar et aL, 
1998b). Protein candidates for physical interaction 
that are identified by the conservation of their gene 
order can further be analysed by the methods men- 
tioned above. An example are the ABC transpor- 
ters which were experimentally shown to consist 
of physically interacting proteins (Eym et aL, 1996) 
and are found in conserved gene clusters in differ- 
ent genomes (Figure 5). The conservation of gene 
order can of course be used for the prediction of 
functional features of hypothetical proteins (inter- 
action with a neighbour and, if this one is charac- 
terized, even participation in a pathway). 

Reconstruction of pathways 

The prediction of reactions and pathways 
(example: Figure 6) of the respective organisms 
integrates all the data above (including errors at 
different levels!) into its phenotypic context and 
yields a more complete picture of the biochemical 
and adaptive capabilities of the sequenced organ- 
ism (Overbeek et aL, 1996). Mispredictions, wrong 
annotations and higher level errors (substrate 
specificity etc.) have to be minimized by context 
information and additional experimental data. Pro- 
blems specific to pathway predictions arise, such 
as non-orthologous displacements (enzymatic 



Review: Predicting Function Using Genomes 



719 



Thermoplasma 
acidophilum: 
Non-phosphorylated 
Entnor-Doudorofl 

NAD(P> + 

Gluconic acid O ^ 



Ecoti, 
Eucaryotic cells: 
normal 

Glycolysis 



NAD{ P)H+H + NADp , 



Guconicacid-6-Phosphat e 




NADPH+H* 



Phospho- - 
fructotUnaso 



— O 1-^0 



g'kelo-3-deoxy 




— Qucose 

ATP — O V m ^ > ADP 

> 

Glucoses-Phosphate -' — — 

Frucl ose-6-Phosphat e 

ATP — O ^ — G> ADP 

Ructose-1,6-biph6sphat© Aldolase 
X Dihydroxyacet one-phosphat e 
■£> Gfyceraldehyd-3-phosphate 

Pi,NAD*-#» ^ — NADH+H+ 
1 ,3-di-phospho-glycerate 

ADP -O ^ — O ATP 
3-phospho-glycerate 



2 -phospho- gtycerate 

^-OH,0 
Riospho-enol-pyruvate 

ADP — G> ^ — G> ATP 
— -G> Pyruvate — — 



Pyruvate 
kinase 




P£P-$yttotfm$e 



t 



Figure 6. Prediction of metabolic pathways and pathway alignment. The glycolytic pathway (centre) and alterna- 
tive other routes (sides) predicted from the genome and observed in several microorganisms are shown and com- 
pared (pathway alignment) to illustrate the often underestimated variability of metabolic pathways. Key enzymes 
discussed are shown in bold. In the glycolytic pathway (centre) two molecules of triose are derived from one hexose 
(as dihydroxyacetone phosphate can also be converted into glyceraldehyde-3-phosphate), the energy yield is two 
mole ATP per mole glucose. Genome analysis shows that the complete glycolytic pathway is present in E. coli, as it 
is, incidentally, in most eukaryotic organisms and cells including all human cells. In contrast, in R. pylori , a causative 
agent for stomach ulcer and chronic ulcerative gastritis, phosphofructokinase in the upper part of the glycolytic path- 
way and the important enzyme pyruvate kinase in the lower part seem to be missing. Thus a different route has to 
be taken in H. pylori. According to our analysis (right bottom), a homologue of phosphoenol-pyruvate (PEP) synthe- 
tase is present, which may support the missing step of the pyruvate kinase albeit at a reduced energy yield. The 
taking over of the role of pyruvate kinase by phophoenol-pyruvate (PEP) synthetase could be an adaptation to the 
highly acidic environment of the stomach in which H. pylori has to survive. More, and also more complex phenotypic 
features of H. pylori can be understood in this way by a pathway analysis utilizing differential genome comparisons 
(Huynen et al, 1998a). What about alternatives for the first part of glycolysis? This is illustrated in Figure 6 for Myco- 
plasma species which are non-glycolytics: Phosphofructokinase is missing (as probably is the case in H. pylori where 
only some homologue to pfkB from E. coli is present but is likely to be utilized differently) and also aldolase is 
absent, for instance in Mycoplasma hominis. These species seem to channel instead glucose by the pentosephosphate 
cycle (Pollack et al, 1997), which also yields glyceraldehyde-3-phosphate plus ribose and NADPH for nucleotide syn- 
thesis and is thus less dispensable than parts of glycolysis in these very compact genomes (Himmelreich et al, 1997). 
Our own investigations indicate that this should only be stochiometrically possible if there are additional enzymes in 
the genome or additional functions of known enzymes which serve to replenish the pool of sugar phosphates in the 
pentose phosphate pathway. There are several further alternatives for converting glucose to pyruvate. The Entner- 
Doudoroff pathway (bottom left), is used instead of glycolysis in some bacteria (Danson & Hough, 1992). Further- 
more, genome analysis by us and others shows that this route is present as a backup pathway for instance in all 
Gram-negative genomes analysed to date. The ATP yield is only one mole per mole glucose. Probably it survived as 
an exclusive pathway in some genomes due to its simplicity and direct yield of NADPH. Top left shows the non- 
phosphorylated Entner-Doudoroff-pathway. This is an example of paleo-metabolism and due to the direct conversion 
of glucose to gluconic acid not yet optimized to obtain any net ATP yield per mole glucose (Melendez-Hevia et al, 
1997). It is present in some Archaea such as Thermoplasma acidophilum (Fields, 1987). 
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activities may then be overlooked in homology 
searches; Koonin et aL, 1996). 

Databases increasingly facilitate the prediction of 
metabolic pathways, notably EcoCyc (Encyclope- 
dia of E. coli Genes and Metabolism), HinCyc: 
(Encyclopedia of H. influenzae Genes and Metab- 
olism), PUMA (see below), Biocatalysis/ Bio- 
degradation Databases, Enzyme Database 
(e.g. www.expasy.ch/sprot/enzyme.html), Ligand 
Databases (e.g. at the Japanese genome net, its 
compound section is a collection of metabolic 
compounds including substrates, products, and 
inhibitors; www.genome.ad .jp / dbget/ ligand.html), 
Klotho, the biochemical compounds declarative 
database; KEGG (Kyoto Encyclopedia of Genes 
and Genomes page) and pathway pages on the 
Web such as the Boehringer Mannheim pathways 
chart, NetBiochem Welcome Page or pages on par- 
ticular organisms such as for soybean metabolism 
(cgsc.biology.yale.edu/metab. html). Experimen- 
tally verified pathway databases have been col- 
lected for regulatory circuits such as cell cycle in 
yeast, human and budding yeast (BRITE project, 
www.genomes.ad.jp/brite/ Cellcyclemaps.html), as 
cross-references (object oriented database manage- 
ment system ACEDB) between protein kinases, 
their interactions, 3D structure and pathways by 
Igarashi & Kaminuma (1997) and for fly genes 
involved in pattern formation (Jacq et aL, 1997). 

Several software tools reconstruct metabolic 
pathways, usually in association with databases 
(see above). Early efforts (Seressiotis & Bailey, 
1988; Mavrovouniotis et aL, 1990) required exten- 
sive pre-analysis of the genome and the proteins 
encoded therein. More recent developments 
include Magpie (Multipurpose Automated Genome 
Project Investigation Environment), an automated 
genome analysis tool (Gaasterland & Sensen, 1996) 
that accesses several databases through an object 
and attribute viewer. Reaction equations, and com- 
pounds are taken from the Enzyme and Metabolic 
Pathway Database (Selkov et aL, 1996) and have 
been assigned via homology to proteins from sev- 
eral organisms. The precomputed reconstruction 
can be accessed via the Web. The WIT (What Is 
There) system (Overbeek et aL, 1997) is similar in 
concept, but offers a wide range of query options. 
It is a useful toolkit (http://www.cme.msu.edu/ 
WIT) to briefly check for pathways that might be 
present in the genome of interest. Also with the 
KEGG database pathway computations are poss- 
ible, for instance testing the completeness of an 
enzyme list (e.g. from a genome sequencing pro- 
ject) with regard to a certain pathway (Ogata et aL, 
1998). 

Nevertheless, current reconstruction of metabolic 
pathways from sequence is mostly done manually 
using various tools that guide the decisions with 
consideration of accumulated biochemical and bio- 
logical knowledge. For example the EcoCyc WWW 
Server (Karp et aL, 1998) is used as a reference and 
each possible hit there is carefully checked for 
orthology (the whole protein function should be 
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similar, the sequence similarity should not be 
restricted only to a functional domain; otherwise 
no complete function transfer possible). For the lat- 
ter an efficient tool is the COGs server (clusters of 
orthologous genes; Tatusov et aL, 1997; http:// 
www.ncbi.nlm.nih.gov/COG/). Profile alignments 
of important enzymatic activities such as signa- 
tures for pathways are also used and are being 
developed in several other laboratories (e.g. 
Rawlings & Searls, 1997). 

The prediction of interdependencies of genes 
and metabolism 

Utilizing the tools and approaches above 
together with methods for comparative sequence 
and genome analysis, a number of specific predic- 
tions in recently sequenced prokaryotic organisms 
have been made that go beyond the analysis pro- 
vided in the publications on sequenced genomes 
for prokaryotes (Selkov et aL, 1996, 1997; Tatusov 
et aL, 1996; Koonin et aL, 1996; Strauss & Falkow, 
1997) and eukaryotes (Oliver, 1997; Palsson, 1997). 
The plasticity and the enzyme variety even of very 
basic pathways turns out to be surprisingly high. 
Figure 6 illustrates this for variants from standard 
glycolysis encountered after genome analysis. 

Predictions for protein functions and enzyme 
pathways just cover the repertoire of functions pre- 
sent. However, metabolic control analysis also con- 
siders quantitative aspects such as flux, flow, 
concentrations, stochiometric and allosteric effects, 
compartimentalization and regulation (see e.g. 
Schuster, 1996; Thomas & Fell, 1996; Bish & 
Mavrovouniotis, 1998 and references therein). 
Knowledge of possible metabolite flows (i.e. differ- 
ent paths and orders of reactions given a constant 
number of enzymes; "elementary modes", 
Schuster & Hilgetag, 1994, 1995; Liao et aL, 1996; 
Nuno et aL, 1997; Bonarius et aL, 1997) should 
improve the understanding of the context of ident- 
ified enzymes in the near future. This requires 
well-studied systems. However, exactly these can 
be achieved by extensive genome and proteome 
analysis. 

Comparative analysis of complete genomes pro- 
vides further tools to study gene interdependence. 
For example, genes that depend on each other are 
expected to occur together in genomes or to be 
absent altogether. By doing large scale comparative 
genome analysis such correlations between genes 
become apparent and provide an extra tool for 
finding connections in metabolism or signalling 
cascades. An example are sets of genes shared by 
M. genitalium and one of either M. jannaschii (set 1) 
or M. thermoautotrophicum (set 2), but not by the 
other. Set 1 encodes among others, the functionally 
related proteins phosphoglucose isomerase, glycer- 
aldehyde 3-phosphate dehydrogenase and pyru- 
vate kinase, that are all involved in glycolysis, 
whereas set 2 contains the genes for DnaK and 
DnaJ, parts of a chaperone pathway (Huynen & 
Bork, 1998). 
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Robustness, modularity and interdependence 

When considering all the levels discussed there 
seems to be a discrepancy between the complex 
nature of the networks of genes and their interde- 
pendence (e.g. via regulation) on the one hand and 
the surprising robustness (e.g. horizontal gene 
transfer or gene loss) on the other. One way in 
which such robustness might be achieved is a 
highly modular organisation, the interdependencies 
of genes would then be limited to small sets. As yet 
we do not have a quantitative understanding of the 
modularity of cellular organisation, including the 
genome, and its implications for the flexibility and 
robustness of evolution. One also needs to keep in 
mind that the examples of robustness we see are a 
selected set: evolution does not report negative 
results. We have tried to show here the powers of 
using information contained in entire genomes, i.e. 
context information and the interdependencies of 
genes within a genome. These rules affect the func- 
tion prediction process in various ways. 

Limited prediction accuracy at all levels 
and interdependence 

Although many methods exist for various 
aspects of each prediction step, one has to bear in 
mind that they are not perfect and have only a lim- 
ited accuracy. In addition, most of the methods 
have (sometimes hidden) parameters that influence 
the search result drastically (just switch in BLAST 
the matrix from default BLOSUM62 to PAM250 
and watch the changes in the output). Fortunately, 
the loss of information in each step is compensated 
by the fact that data are produced by experimental 
methods in all the different levels (Figure 2). Thus, 
the errors do not add up and can be compensated 
by information from different levels e.g. by using 
genome information to improve the prediction of 
protein function as described here. Experimental 
validation of hypotheses can also be conducted at 
all levels, the interdependence allows even the 
interpretation of cellular data for molecular fea- 
tures and vice versa. 

Modularity at each level and robustness 

Modularity already is present at the DNA 
sequence level in repeats, ubiquitous promotors, 
duplicated segments, etc. The limited set of 
domains, used again and again as structural and 
functional scaffold, documents modularity at the 
gene and protein level. Displacement of non-hom- 
ologous but functionally equivalent enzymes and 
the distinct pathway variants that all lead to the 
same compounds (see Figure 6) are evidence for 
modularity at the cellular level. Complex systems 
such as the cytoskeleton (animals versus Mycoplas- 
mas) or even specialized organs (vertebrate versus 
octopus eye) do not represent unique solutions and 
reveal that even tissues can be re-invented on the 
basis of lower level modules. Thus, a remarkable 



robustness can be observed at all levels, the bal- 
ance of which might seem surprising given the 
shuffling, horizontal transfer, disruption, insertion 
etc. . of genetic material. On the other hand, the 
robustness represents also hope that functional fea- 
tures are more significantly implicated and predict- 
able from sequence than previously expected. 

Prediction of function from sequence is a con- 
siderably more complex enterprise than a simple 
sequence database search which represented the 
entire repertoire of tools a few years ago. In par- 
ticular, with the arrival of multiple entirely 
sequenced genomes and experimental input at var- 
ious complexity levels we have the chance to 
approach a new quality of understanding of cellu- 
lar processes and their evolution. 
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A genomic DNA segment encoding an extracellular laccase was isolated from the thermophilic fungus 
Myceliophthora thermophila ^ and the nucleotide sequence of this gene was determined. The deduced amino acid 
sequence of M. thermophila laccase (MtL) shows homology to laccases from diverse fungal genera. A vector 
containing the M. thermophila laccase coding region, under transcriptional control of an Aspergillus oryzae 
a-amylase gene promoter and terminator, was constructed for heterologous expression in A. oryzae. The 
recombinant laccase expressed in A. oryzae was purified to electrophoretic homogeneity by anion-exchange 
chromatography. Amino-terminal sequence data suggests that MtL is synthesized as a preproenzyme. The 
molecular mass was estimated to be approximately 100 to 140 kDa by gel filtration on Sephacryl S-300 and to 
be 85 kDa by sodium dodecyl sulfate-polyacrylamide gel electrophoresis. Carbohydrate analysis revealed that 
MtL contains 40 to 60% glycosylation. The laccase shows an absorbance spectrum that is typical of blue copper 
oxidases, with maxima at 276 and 589 nm, and contains 3.9 copper atoms per subunit With syringaldazine as 
a substrate, MtL has optimal activity at pH 6.5 and retains nearly 100% of its activity when incubated at 60°C 
for 20 min. This is the first report of the cloning and heterologous expression of a thermostable laccase. 



Laccases (EC 1.10.3.1) are multicopper enzymes that cata- 
lyze the oxidation of a variety of phenolic compounds, with 
concomitant reduction of 0 2 to H 2 0. These polyphenol oxi- 
dases are widely distributed among plant (9, 10, 41, 51) and 
fungal (8, 14) species; however, their biological significance is 
unclear. Among the filamentous fungi, approximately 30 lac- 
cases have been identified in various organisms. These laccases 
may be involved in conidial pigmentation (4, 22), lignin deg- 
radation (2, 17, 34, 39, 56, 57, 63), pathogenicity (5), and 
formation of fruiting bodies (38). Interest in laccases has been 
fueled by their potential uses in detoxification of environmen- 
tal pollutants (13, 15, 16, 26, 46, 54, 61), prevention of wine 
decoloration (37), paper processing (50), enzymatic conversion 
of chemical intermediates (1), and production of useful chem- 
icals from lignin (57). 

For any of these potential applications to become a reality, 
an inexpensive source of laccase must be obtained. In most 
fungi, laccases are produced at levels that are too low for 
commercial purposes. Cloning of the laccase genes followed by 
heterologous expression may provide higher enzyme yields. A 
number of genes encoding fungal laccases are cloned, includ- 
ing those from basidiomycetes such as Trametes (Coriolus) 
versicolor (32, 33), Trametes villosa (75), Coriolus hirsutus (35), 
Rhizoctonia solani (69), Agaricus bisporus (49), Phlebia radiata 
(56), basidiomycete PM1 (23), and ascomycetes Cryphonectria 
parasitica (19), Aspergillus nidulans (4), Podospora anserina 
(28), and Neurospora crassa (29). Collectively, the amino acid 
sequences deduced from these genes suggest that the overall 
structure of fungal laccases is similar to that of ascorbate oxi- 
dase from Zucchini (43). 

The laccase genes from C. hirsutus and P, radiata have been 
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expressed in Saccharomyces cerevisiae (35) and Trichoderma 
reesei (55), respectively. The yeast GAL 10 promoter was used 
to direct the expression of enzymatically active C. hirsutus, with 
yields of approximately 5 mg per liter (74). The P. radiata 
laccase was secreted at a level of about 20 mg per liter by using 
the promoter and terminator regions of the T. reesei cbhl gene 
(55). Clearly, higher enzyme titers are required for commercial 
enzyme production. Several Aspergillus species, including As- 
pergillus oryzae, are well-established as good expression systems 
for the heterologous production of industrial enzymes (7, 20, 
24). We hypothesized that A. oryzae might also be well suited 
as a host for laccase expression and secretion. 

It is well documented that thermophilic fungi may comprise 
a rich source of thermostable industrial enzymes (53). Further- 
more, thermal tolerance is an attractive feature for many 
biotechnological applications of enzymes. The thermophilic 
fungus Myceliophthora thermophila (telomorph = Thielavia het- 
erothallica) was described previously as a producer of cellulase 
and xylanase enzymes with pronounced thermal resistance (48, 
53, 58-60, 76). M. thermophila was first described by Apinis (3) 
and given the name Sporotrichum thermophile. Its taxonomic 
position was reassigned to the genus Chrysosporium (68) and 
later to its current genus (65). Our objectives were to deter- 
mine if M. thermophila produced a thermostable extracellular 
laccase, clone the gene encoding it, express this gene in A. 
oryzae, and biochemically characterize the resulting enzyme. 

MATERIALS AND METHODS 

Fungal strains and plasmids. Genomic DNA was isolated from M. ther- 
mophila CBS 1 17.65. Escherichia coli JM10I (45) was used for construction and 
routine propagation of laccase expression vectors. The fungal host for laccase 
expression was a uridinc-rcquiring ipyrG) mutant of the a-amylasc-dcficicnt A. 
oryzae strain HowB104. . 

The vector pMWR3 was constructed by inserting the A. oryzae oc-amylasc 
promoter and terminator elements from pTAKA17 (11, 21) into pUC18 (72). In 
this vector, there are a Swa\ site at the end of the promoter and an Nsil site at 
the beginning of the terminator for directional cloning. The cloning vehicle 
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pUC519 (6a) was derived by inserting a small linker containing Nsil, Clal,Xhol, 
and Bglll restriction sites between the adjacent BamH I and Xba\ sites of pUC119 
(66). 

Materials. The chemicals, buffers, and substrates used were commercial prod- 
ucts of at least reagent grade. Endo/W-glycosidase F and pyroglutamate amino- 
peptidase were purchased from Boehringer Mannheim (Indianapolis, Ind.). 
Chromatography was done by using fast protein liquid chromatography system 
(Pharmacia LKB, Uppsala, Sweden) or a conventional open low-pressure system. 
Spectroscopic assays were conducted with either a UV160 UV-visible light spec- 
trophotometer (Shimadzu, Inc., Columbia, Md.) or a microplate reader (Molec- 
ular Devices, Menlo Park, Calif.). 

DNA extraction and hybridization analysis. Total cellular DNA was extracted 
from M. thermophila cells by the procedure described by Timberlake and Bar- 
nard (64). Genomic DNA samples were analyzed by Southern hybridization (25) 
under conditions of mild stringency (i.e., 5X SSPE [1 X SSPE is 0.18 M NaCl, 10 
mM NaH 2 P0 4) and 1 mM EDTA; pH 7.7], 35% formamide, 0.3% sodium 
dodecyl sulfate [SDS]). The laccase-specific probe fragment (approximately 1.5 
kb) comprised the 5' portion of the N. crassa lcc-1 gene. The purified probe 
fragment was radiolabeled by nick translation (42) with [a- 32 P]dCTP (Amer- 
sham) and added to the hybridization buffer at an activity of approximately I0 6 
cpm per ml. The mixture was incubated overnight at 45°C Following incubation, 
the membrane filters were washed once in 0.2X SSPE with 0.1% SDS at 45°C 
and then twice in 0.2X SSPE (no SDS) at the same temperature. The filters dried 
on paper towels for 15 min and then were wrapped in Saran Wrap and exposed 
to X-ray film overnight at -70°C with intensifying screens. 

DNA libraries and Identification of laccase clones. A genomic DNA library 
was constructed in K-EMBL4 (62). Briefly, DNA was partially digested with 
5a«3AI and size fractionated on low-melting-point agarose gels. DNA fragments 
migrating between 9 and 23 kb were excised and eluted from the gel by using 
p-agarase (New England Biolabs, Beverly, Mass.). The eluted DNA fragments 
were ligated with BamHI-cleaved and dephosphorylated X-EMBL4 vector arms, 
and the ligation mixtures were packaged by using commercial packaging extracts 
(Stratagene, La Jolla, Calif.). The packaged DNA library was plated and ampli- 
fied on E. coli K802 cells (42). Approximately 20,000 plaques were screened by 
plaque hybridization (25) with the radiolabeled N. crassa laccase gene fragment 
under the conditions described above. Plaques which gave hybridization signals 
with the probe were purified twice on £. coli K802 cells, and DNA from three of 
these phage was purified by using a Qiagen Lambda kit (Qiagen, Inc., Chats- 
worth, Calif.). 

Analysis of laccase genes. Restriction mapping was completed by standard 
methods (40). DNA sequencing was done with a model 373A automated DNA 
sequencer (Applied Biosystems, Inc., Foster City, Calif.) by using the primer 
walking technique with dye-terminator chemistry (30). The final nucleotide se- 
quence was determined on both strands. Oligonucleotides were synthesized on 
an Applied Biosystems model 394 DNA/RNA synthesizer. 

Construction of laccase expression vectors. Construction of the laccase ex- 
pression vector pRaMBS is outlined in Fig. 1. The promoter directing transcrip- 
tion of the laccase gene segment was obtained from the A. oryzae" a-amy\a$e 
(TAKA- amylase) gene (21). The a-amylase polyadenylation/transcription termi- 
nator region from pTAKA17 was also used in construction of this vector (21). 

Cotransformation of A. oryzae. Methods for cotransformation of A. oryzae 
were described by Christensen et al. (21), Equal amounts (approximately 5 u,g 
each) of laccase vector and one of the following plasm ids were used: ppyrG 
(Fungal Genetics Stock Center, Kansas City, Kans.), which contains the A, 
nidulans pyrG gene (47), or pS02, which harbors the A. oryzae pyrG gene. 
Prototrophic (Pyr + ) transformants were selected on Aspergillus minimal medium 
(52), and the transformants were screened for the ability to produce laccase on 
minimal medium containing 1 mM 2,2'-azinobis(3-ethylbenzthiazolin-6-sulfonic 
acid) (ABTS). Cells that secreted active laccase oxidized the ABTS, producing a 
green halo surrounding the colony. 

Analysis of laccase-producing transformants. Transformants that produced 
laccase activity on agar plates were purified twice through conidiospores, and 
spore suspensions in sterile 0.01% Tween 80 were made from each. The density 
of spores in each suspension was estimated spectrophotometrically by absorption 
at 595 nm. Approximately 0.5 absorbance units of spores was used to inoculate 
25 ml of shake flask medium in 125-ml plastic flasks. The shake flask medium 
contained the following (per liter): 1 g of CaCl 2 * 2H 2 0, 2 g of yeast extract, 1 g 
of MgS0 4 , 2 g of citric acid, 5 g of KH 2 P0 4 , 1 g of urea, 2 g of (NH 4 ) 2 S0 4 , 20 g 
of maltodextrin, and 0.5 ml of trace elements solution (36). The cultures were 
incubated at 37°C with vigorous aeration (approximately 200 rpm) for 4 to 5 days. 
Culture broths were harvested by centrifugation, and the amount of laccase 
activity in the supernatant was determined. Transformants producing the highest 
levels of the recombinant M. thermophila laccase (r-MtL) in shake flask cultures 
were also grown in laboratory fermentors. 

Laccase assays. The syringaldazine oxidase activity of r-MtL was determined 
by using 19 u.M syringaldazine and monitoring the absorbance change at 530 nm 
(extinction coefficient = 65 mM" 1 cm -1 [6]). One syringaldazine oxidation unit 
(SOU) was defined as the amount of enzyme that oxidizes 1 u.mol of substrate 
per min in 1 ml at 20°C. ABTS oxidation assays were done by using 1 mM ABTS 
and monitoring the absorbance change at 418 or 405 nm (extinction coefficient = 
36 or 35 mM -1 cm" 1 , respectively [18]). Britton and Robinson (B&R) buffers, 
made by mixing 0.1 M boric acid-0.1 M acetic acid-0.1 M phosphoric acid with 
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0,5 M NaOH to the desired pH, were used to determine the pH activity profile 
of r-MtL. Thermostability analysis of r-MtL was performed by using 0.8 to 1.2 
u,M samples preincubated in B&R buffer (pH 6) at various temperatures. The 
samples were assayed for syringaldazine oxidase activity after a 430-fold dilution 
at room temperature. 

Purification of native MtL from M. thermophila culture broth. M. thermophila 
was grown for 5 days at 42°C in medium which contained the following: 1% 
glucose, 4% dextrin, 2% ammonium citrate, 0.1% MgS0 4 • 7H 2 0, 1% KH 2 P0 4 , 
0.1% CaCl 2 , 0.01% FeS0 4 • 7H 2 0, 0.05% CuS0 4 * 5H 2 0, 0.01% pluronic 
antifoam, and 0.5% PWH salts [containing (per liter) 0.3 g of ZnCl 2 , 0.6 g of 
FeS0 4 • 7H 2 O f 0.25 g of CuS0 4 , 0.35 g of MnS0 4 • H 2 0, 0.2 g of (NH 4 ) fi Mo 7 0 24 • 
4H 2 0, and 6 g of tetrasodium-EDTA]. The mycelia were removed by filtration, 
and the culture broth was washed, filtered, adjusted to pH 7, and applied to a 
Q-Sepharose (Hiload 26/10; Pharmacia) column that was preequilibrated with 
0.1 M phosphate buffer, pH 7. The laccase concentration of the crude broth was 
approximately 5 mg per liter. The laccase activity eluted with a gradient of 0 to 
1 M NaCl. An 11-fold purification and recovery yield of 56% were achieved. 

Purification of r-MtL from A. oryzae culture broth. A washed, concentrated 
broth sample (pH 7,6; conductivity = 0,8 mS) was loaded onto a Q-Sepharose 
XK26 column (120 ml; Pharmacia) preequilibrated with 10 mM Tris, pH 7.5. 
MtL has an intense blue color (corresponding to an absorbance peak at 600 nm) 
that is typical of multicopper oxidases (44). One group of blue fractions was 
collected after the column was loaded and washed. A second group eluted with 
a linear gradient of 0 to 2 M NaCl. SDS-polyacrylamide gel electrophoresis 
(SDS-PAGE) analysis showed that this preparation was essentially pure laccase. 
A purification of 121-fold and recovery of 67% were achieved. The purified 
r-MtL showed no activity loss over a 5-week-long storage frozen in Q-Sepharose 
elution buffer at -20°C. 

Analyses of amino acid composition, carbohydrate content, N*terminal and 
C* terminal sequences, copper content, and native molecular mass. N-terminal 
sequencing was done with an Applied Biosystems model 476A protein se- 
quencer. The sequencing reagents were from Perkin-Elmer/Applied Biosystems 
Division (Foster City, Calif.). A 1090L high-pressure liquid chromatography 
system (Hewlett-Packard Co., Wilmington, Del.) equipped with diode array 
detection at 215 and 280 nm and 3D Chemstation software was used for the 
separation of CNBr- and protease-generated enzyme fragments. Separations 
were done on a Vydac C 4 or C 18 reverse-phase column (Vydac, Hesperia, Calif.), 
C-terminal sequencing was done by J. M. Bailey of Hewlett-Packard Co. Total 
amino acid analysis, from which the extinction coefficient of r-MtL was deter- 
mined, was done with a Hewlett-Packard 1090 AminoQuant instrument. 

Hydrolyses of protein-bound carbohydrate for monosaccharide compositional 
analysis were done in duplicate. Lyophilized samples were hydrolyzed in evacu- 
ated sealed glass tubes with 100 u.1 of 2 M trifluoroacetic acid (TFA) for 1 and 
4 h at 100 C C. Monosaccharides were separated by high-performance anion- 
exchange chromatography using a CarboPac PA1 column (Dionex Corporation, 
Sunnyvale, Calif.), eluted with 16 mM NaOH, and detected by pulsed ampero- 
metric detection. Due to the different stability and release of the monosaccha- 
rides in 2 M TFA, the amounts of glucosamine and mannose were determined 
after 4 h of hydrolysis, whereas the amount of galactose was determined after 1 h 
of hydrolysis. Deglycosylation was also achieved by using endo/Af-glycosidase F 
(Boehringer Mannheim) according to the manufacturer's instructions, and the 
carbohydrate content of r-MtL was estimated from the mobility difference in 
SDS-PAGE. Enzymatic removal of the N-terminal pyroglutamate residue was 
done with pyroglutamate aminopeptidase (Boehringer Mannheim) in accor- 
dance with the manufacturer's instructions. About 80 u,g of r-MtL was treated 
with 4 u.g of peptidase with or without 1 M urea or 0.1 M guanidine HC1 and then 
transferred to a polyvinylidene difluoride membrane for sequencing. About 20 
pmol of peptidase-treated protein was obtained and sequenced. 

SDS-PAGE and native isoelectric focusing (IEF) analysis were done on com- 
mercial apparatus (Novex, San Diego, Calif., and Bio-Rad Laboratories, Her- 
cules, Calif.). Proteins were stained with Coomassie brilliant blue. Gel filtration 
analyses were done on a Sephacryl S-300 (Pharmacia) column, and the native 
molecular mass was estimated by using blue dextran (2,000 kDa), bovine immu- 
noglobulin G (158 kDa), bovine serum albumin (66 kDa), ovalbumin (45 kDa), 
and horse heart myoglobin (17 kDa) to calibrate the column. 

The copper content was determined by the photometric titration method of 
Felsenfeld (27) and by atomic absorption spectroscopy. 

The extinction coefficient for r-MtL was calculated on the basis of amino acid 
analysis, and the molecular mass was deduced from the DNA sequence. 

Nucleotide sequence accession number. The nucleotide sequence of the led 
coding region was determined and deposited in the GenSeq database under 
accession no. T10922. 



RESULTS 

Cloning and characterization of sequence of the laccase 
gene from Af. thermophila. Genomic DNA was prepared from 
N. crassa and M. thermophila, digested with BamHI, fraction- 
ated by agarose gel electrophoresis, blotted, and probed under 
conditions of mild stringency with a radiolabeled fragment 
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FIG. 1 . Scheme for construction of the vector pRaMBS for Myceliophthora laccase expression in Aspergillus. First, plasmid pMWR3 was modified by inserting a small 
linker that contained znApal site between the Swal and Nsil sites, creating a plasmid called pMWR3-SAN. Second, Pju polymerase-directed PCR (Stratagene) was 
used to amplify a short DNA segment encoding the 5' portion of MtL, from the start codon to an internal Pstl site (approximately 0.5 kb). The forward primer for this 
PCR was designed to create an EcoRI site just upstream of the start codon (Fig. 2). Next, the amplified fragment was digested with EcoRI and Pstl and subcloned into 
an M13mpl8 sequencing vector, and its nucleotide sequence was verified. This fragment was subsequently excised by cleavage with EcoRI and Pstl (during this step, 
the EcoRI site was made blunt by treatment with deoxynucleoside triphosphates and DNA polymerase I [Klenow fragment]) and purified by agarose gel electrophoresis. 
The 3' portion of the led coding region was excised from pRaMB2 as a 2-kb Pstl-Apal fragment (this segment also contains approximately 110 bp from the 3' 
untranslated region). Lastly, these two fragments were combined with Swal- and/tpal-cleaved pMWR3-SAN in a three-part ligation reaction to generate the laccase 
expression vector pRaMB5. 



encoding a portion of the crassa laccase gene. A single 
laccase-speciflc DNA fragment was detected in both genomic 
digests. We then screened approximately 20,000 plaques from 
an M. thermophila genomic DNA library. Eight plaques that 
hybridized strongly to the probe were identified. DNA was 
isolated from three of these plaques, cleaved with EcoRI, and 
analyzed by agarose gel electrophoresis and Southern hybrid- 
ization. All three clones contained a 7.5-kb £coRI fragment 
which hybridized to the laccase-speciflc probe. One fragment 
was subcloned into pBR322 (12) to generate plasmid 
pRaMBl. The entire M. thermophila laccase gene (led) coding 
region was contained within a 3.2-kb Nhel-Bglll segment that 
was subcloned into pUC119 (66) to give plasmid pRaMB2. The 



nucleotide sequence of this segment was determined on both 
strands by the primer walking method (30). 

The positions of six introns (85, 84, 102, 72, 147, and 95 
nucleotides in length) within the led coding region were de- 
termined by comparing the deduced amino acid sequence of 
MtL to that of M crassa laccase and by applying the consensus 
rules for intron features in filamentous fungi (31). Addition- 
ally, the amino acid sequences of several internal peptide frag- 
ments from recombinant MtL were determined, and the cor- 
rect reading frame for the led gene as well as the positions of 
the second, third, and sixth introns was verified. The 1,860 
nucleotides of coding sequence are 65.5% G+C, with a strong 
bias (90%) for codons ending in G or C. 
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Swa\- Ei coRI junctionf 
1 



r 



I | Met Lys Ser 

. . . GAACAATAAACCCCACAGAAGGCATTTAATTCGGCTTCCACC ATG AAG TCC CTGCA 

. . . CTTGTTATTTGGGGTGTCTTCCGTAAATTAAGCCGAAGGTGG TAC TTC AGG G 

<* II > 



A oryzae TAKA-amylase promoter (1.1 kb) 



EcoRI-F st I fragment generated by PCR with Pfu polymerasef 



FIG. 2. Scheme for joining an A. oryzae a-amylase promoter to the MtL coding region in the expression vector pRaMBS. t» EcoRl site that was made blunt by 
treatment with DNA polymerase I (Klenow fragment) and deoxynucleoside triphosphates. 



The deduced amino acid sequence of MtL shares identity 
with laccases of the following species: Podospora anserina 
(65%), N. crassa (60%), C. parasitica (5?>%),Agaricus bisporus 
(24%), P. radiata (22%), Pleurotus ostreatus (22%), R. solani 
(20%), T. versicolor (22%), T. villosa (22%), and A, nidulans 
(15%). Similarity is highest in the regions that correspond to 
the four histidines and one cysteine that form the trinuclear 
copper cluster (23, 44, 49). There are 11 potential sites (Asn- 
X-Ser/Thr) for N-Iinked glycosylation in the deduced amino 
acid sequence of MtL. 

The first 22 amino acids in the deduced primary structure of 
MtL appear to comprise a canonical signal peptide with a 
predicted cleavage following an Ala residue (67). The purified 
extracellular forms of both native MtL and r-MtL are blocked 
with N-terminal pyroglutamate residues. Enzymatic removal of 
these residues followed by amino acid sequencing suggests that 
mature MtL begins with a Gin residue. Thus, MtL is appar- 
ently synthesized as a 619-residue preproenzyme having a 22- 
residue signal peptide and a propeptide of 25 residues. 

Expression of Myceliophthora laccase. The expression vector 
pRaMB5 (Fig. 1 and 2) was used to generate A. oryzae cotrans- 
formants which produce r-MtL that were detected by incorpo- 
ration of ABTS into selective media. As determined by using 
the pyrG gene from A. nidulans or A. oryzae as the selectable 
marker, the frequencies of laccase-producing cotransformants 
among Pyr + colonies were 59% with A. nidulans pyrG as the 
selected marker and 31% with A. oryzae pyrG as the marker. 
Several cotransformants that produced intense color reactions 
on ABTS plates were grown in shake flask cultures to quanti- 
tate the amount of r-MtL produced. The amount of extracel- 
lular laccase activity produced ranged from 0.49 to 0.85 
SOU/ml (Table 1). On the basis of the specific activity of 45 
SOU/mg (see below), the level of r-MtL secreted in these 
shake flask cultures ranged from 11 to 19 mg per liter. Prelim- 
inary SDS-PAGE analyses of culture broth samples showed a 
prominent laccase band at approximately 85 kDa, which is 
similar to the size of the native enzyme purified from M. ther- 
mophila. 



TABLE 1. MtL expression among selected A. oryzae transform ants* 



Transformant 


Transforming DNAs* 


SOU/ml in shake flask 


Control 


None 


0.00 


RaMB5.15 


pRaMBS + ppyrG 


0.85 


RaMB5.30 


pRaMB5 + ppyrG 


0.71 


RaMB5.33 


pRaMB5 + ppyrG 


0.60 


RaMB5.108 


pRaMBS + pS02 


0.68 


RaMB5.Hl 


pRaMBS + pS02 


0.70 


RaMB5.121 


pRaMB5 + pS02 


0.49 


RaMB5.142 


pRaMBS + pS02 


0.54 



Biochemical characterization of r-MtL produced in A. 

oryzae. Q-Sepharose chromatography yielded two active frac- 
tions that contained essentially pure laccase (one passed 
through the column, and another was eluted by a 0 to 2 M 
NaCI gradient). Purified r-MtL has a molecular mass of 100 to 
140 kDa as determined by S-300 gel filtration (data not shown) 
and a molecular mass of 75 to 95 kDa as determined by SDS- 
PAGE (Fig. 3). Under nondenaturing conditions, both r-MtL 
fractions had a pi of 4.2. Treatment of the purified enzyme 
with JV-glycosidase resulted in a decrease in the apparent mo- 
lecular mass to approximately 73 kDa (Fig. 3). The increased 
mobility on SDS-PAGE after deglycosylation suggested that 
N-linked carbohydrates accounted for approximately 14% of 
the total mass of each subunit. Total-carbohydrate analysis 
showed that the laccase fractions that passed through the Q- 
Sepharose column (preequilibrated with 10 mM Tris, pH 7,5) 
contained 26 mol of glucosamine, 67 mol of galactose, 9 mol of 
glucose, and 138 mol of mannose per mol of enzyme. The 
laccase fractions that eluted from the Q-Sepharose with NaCI 
had 23 mol of glucosamine, 38 mol of galactose, 4 mol of 
glucose and 85 mol of mannose per mol of enzyme. 

Attempts to directly sequence the N terminus of r-MtL from 
samples either in desalted solution or on polyvinylidene diflu- 
oride membranes were unsuccessful. Treatment of r-MtL with 
pyroglutamate aminopeptidase yielded a protein with a de- 
blocked N terminus, beginning 48 residues after the putative 
translation start (Met). Sequencing of internal peptides gener- 
ated by CNBr cleavage confirmed the DNA sequence and 
several intron and exon assignments. Direct C-terminal se- 
quencing indicated that r-MtL had a C terminus of -Gly-Leu. 

The UV-visible absorbance spectrum of r-MtL shows ab- 
sorption maxima at 276 and 589 nm. The ratio of the absor- 
bance at 280 nm to the absorbance at 600 nm was 35. This is 
higher than reported for T. villosa (75) and R. solani (69) 
laccases, suggesting that MtL contains more tryptophan, phe- 



kDa 




"A. oryzae HowBltMpyrC was the host strain. 

b Plasm ids ppyrG and pS02 contain the pyrG genes of A. nidulans and A. 
oryzae t respectively. 



FIG, 3. SDS-PAGE profile for the purification of r-MtL. Lane 1, concen- 
trated A. oryzae culture broth to be loaded onto Q-Sepharose; lane 2, purified 
r-MtL; lane 3, r-MtL treated with cndo-/N-glycosidase F. The gel was stained 
with Coomassic brilliant blue. 
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FIG. 4. Dependence of r-MtL activity on pH and temperature. (A) La cease 
activity as a function of pH (normalized to the optimum activity value) of native 
(nonrecombinant) MtL with ABTS as a substrate ( + )> r-MtL with ABTS as a 
substrate (X), native MtL with syringaldazine as a substrate (A), and r-MtL with 
syringaldazine as a substrate (O). (B) Thermostability of r-MtL. Enzyme samples 
(0.8 to 1.2 jiM) were preincubated in B&R buffer (pH 6) for 0.3 (A), 1.0 (x) ( and 
17 (O) h at various temperatures, diluted 430-fold, and assayed for residual 
activity in B&R buffer (pH 6) at 20°C with syringaldazine as the substrate. The 
activities were normalized to the initial activity at 20°C before the preincubation. 



nylalanine, and cysteine. This suggestion was confirmed by 
comparing the amino acid compositions deduced from the 
corresponding gene sequences. Photometric titration and 
atomic absorption spectroscopy indicated a stoichiometry of 
3.9 copper atoms per enzyme subunit. 

r-MtL had an optimal pH of 6.5 with syringaldazine as the 
substrate (Fig, 4A). At the optimum pH, the specific activity of 
r-MtL with syringaldazine used as a substrate was 45 SOU per 
mg. With ABTS as a substrate, r-MtL showed maximum activ- 
ity at the lowest pH studied (pH 2.7). The difference in opti- 
mum pH values for ABTS and syringaldazine substrates is 
consistent with the hypothesis that electron transfer kinetics 
are more important than substrate binding in determining the 
pH activity profile of laccase (71). Thermostability analysis 
shown in Fig. 4B indicated that the upper temperature limit for 
retaining full activity after a 20-min preincubation was 60°C 

Purification and characterization of native MtL from M, 
thermophila. Q-Sepharose chromatography yielded an 11-fold 
purification of laccase from M. thermophila. Purified MtL mi- 
grated as a diffuse band in SDS-PAGE with a molecular mass 
of 80 kDa, and on IEF gels it had an isoelectric point of 4.2. 
MtL showed a UV-visible spectrum with an absorbance max- 
imum of 280 and two smaller shoulders at 330 and 600 nm. The 
weak absorbance maximum at 600 nm indicated the presence 
of apo-MtL (copper depleted) in the preparation, possibly 
resulting from instability during purification. The pH activity 
profile of native MtL on syringaldazine is also shown in Fig. 
4A. At optimum pH, MtL had an activity of 15 SOU per mg. 
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N-terminal sequencing analysis indicated that the amino ter- 
minus was also blocked. 

DISCUSSION 

We cloned a genomic DNA segment encoding an extracel- 
lular laccase from the thermophilic fungus M. thermophila. On 
the basis of a comparison of its deduced amino acid sequence, 
MtL shows identity with laccases from diverse fungal genera. 
However, the greatest degree of sequence identity (53 to 65%) 
is between MtL and the laccases of related species in the order 
Sphaeriales, such as P. anserina (28), N. crassa (29), and C. 
parasitica (19). Laccases from basidiomycetes show more-lim- 
ited sequence identity (20 to 24%) to MtL. Interestingly, the 
conidial laccase (yA gene product) of A nidulans (4) shows the 
lowest degree of similarity, suggesting a possible evolutionary 
or functional difference between conidial and secreted lacca- 
ses. 

At the genomic level, the led gene of M. thermophila also 
has the highest homology with laccase genes from P. anserina, 
N. crassa, and C. parasitica; however, their architecture is very 
different. For example, N. crassa led contains a single intron, 
lac2 from P. anserina has three introns, led from M. ther- 
mophila laccase has six intervening sequences, and the G para- 
sitica laccase gene has 12 introns. The position of the first 
intron is conserved among laccase genes from M. thermophila, 
N. crassa, and P. anserina. Additionally, introns II and III in P. 
anserina lac2 align with the third and fourth introns of M. 
thermophila led. The positions of five intervening sequences in 
M thermophila led are conserved in the G parasitica laccase 
gene. Therefore, we postulate that the laccase genes from 
these four species were derived from a common ancestral form 
but diverged during evolution. The comparatively low level of 
sequence similarity between the laccase genes of basidiomyce- 
tes and ascomycetes probably reflects the large phylogenetic 
distance between these fungal classes. 

The primary structures of the laccase gene products from 
Neurospora, Podospora, and Myceliophthora predict similar 
mechanisms of posttranslational processing. On the basis of 
the rules of von Heijne (67), the predicted signal peptide cleav- 
age site for MtL lies after the first 22 amino acids. However, 
direct sequencing of the amino terminus of native and recom- 
binant forms of MtL suggested that the first residue of the 
mature enzyme is Gln 49 . Therefore, residues 23 through 48 
probably comprise a propeptide whose proteolytic removal 
occurs during maturation of MtL, leaving Gln 49 as the first 
amino acid residue of the mature enzyme, which may subse- 
quently cyclize to pyroglutamate, yielding a blocked N termi- 
nus. It was reported that N. crassa and P. anserina laccases are 
processed similarly at their amino-terminal ends (28, 29). In 
addition, N. crassa laccase is also reportedly processed at its C 
terminus, resulting in the proteolytic removal of 13 residues 
(29). The processing site is contained within the sequence 
Asp-Ser-Gly-Leu j Arg 558 (where J, designates the cleavage 
site). Strikingly similar sequences exist near the C termini of 
MtL (Asp-Ser-Gly-Leu-Lys 560 ) and P. anserina laccase (Asp- 
Ser-Gly-Leu-Lys 559 ). C-terminal sequencing showed that the C 
terminus of MtL was Gly-Leu, indicating that the enzyme was 
processed (Asp-Ser-Gly-Leu j Lys 560 ) similarly to N. crassa 
laccase and 13 residues were removed. C-terminal processing 
of P. anserina laccase was also postulated (28). It is particularly 
interesting that the C. parasitica laccase has Asp-Ser-Gly-Val 
as its C terminus, and thus no processing may be needed. The 
importance of C-terminal processing for the catalytic activity of 
these enzymes is unknown, and the protease involved in this 
cleavage has not been identified. 
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By ion-exchange chromatography, r-MtL from A. oryzae fer- 
mentor broth was separated into multiple isoforms with differ- 
ent elution properties. However, no significant difference 
among these isoforms was seen in terms of SDS-PAGE, native 
PAGE, native IEF, S-300 gel filtration, UV-visible spectrum, 
specific activity towards syringaldazine, and unblocked-N-ter- 
minus sequencing measurements. On the basis of total-carbo- 
hydrate analyses, it appears likely that the different elution 
patterns of various r-MtL isoforms from Q-Sepharose arose 
from differential glycosylation with galactose and mannose. 
Total-carbohydrate analyses also gave an estimate of 33 to 
60% total glycosylation, of which 14% is estimated to be N 
linked, on the basis of the mobility change on SDS-PAGE after 
Af-glycosidase treatment. The r-MtL differed from native MtL 
in two other respects. First, the molecular mass of native MtL 
(80 kDa) was less than that of r-MtL (85 kDa), presumably 
reflecting differences in glycosylation. Second, the specific ac- 
tivity of native MtL (15 SOU/mg) was lower than that of r-MtL 
(45 SOU/mg). Since native MtL also had a lower absorbance at 
600 nm, the decreased specific activity is probably due to a 
percentage of holoenzyme lower than that of r-MtL. Since the 
type II copper is easily depleted (73), extra copper ions were 
added to the culture medium of A. oryzae transform ants ex- 
pressing r-MtL. This appears to have yielded a purified r-MtL 
preparation with a specific activity higher than that of native 
MtL which was isolated from cultures not supplemented with 
additional copper. 

The ascomycete fungus Af. thermophila produces a constel- 
lation of thermostable cellulases (48, 53, 58-60) and at least 
one thermotolerant xylanase (76). Whether this organism 
might be a good source of thermostable laccase was a subject 
of this investigation. The observation that r-MtL retains virtu- 
ally 100% activity after 20-rnin incubation at 60°C seems to 
validate our approach. In addition, Xu et al. (71) disclose that 
MtL not only is, more thermostable than laccases from the 
basidiomycetes T. villosa and R. solani, but also demonstrates a 
pronounced thermal activation such that preincubation at el- 
evated temperatures gives higher activity. 

The yield of r-MtL from A. oryzae cotransformants grown in 
shake flasks was modest (11 to 19 mg per liter). However, these 
yields are consistent with those obtained by heterologous ex- 
pression of basidiomycete laccases in other hosts (35, 55). In 
addition, it seems likely that the heterologous expression of 
laccases in A. oryzae will benefit from the successful history of 
industrial scale-up, strain development, and process methods 
for other Aspergillus enzyme products. 
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The phyA gene encoding an extracellular phytase from the thermophilic fungus Thermomyces lanuginosus was 
cloned and heterologously expressed, and the recombinant gene product was biochemically characterized. The 
phyA gene encodes a primary translation product (PhyA) of 475 amino acids (aa) which includes a putative 
signal peptide (23 aa) and propeptide (10 aa). The deduced amino acid sequence of PhyA has limited sequence 
identity (ca. 47%) with Aspergillus niger phytase. The phyA gene was inserted into an expression vector under 
transcriptional control of the Fusarium oxysporum trypsin gene promoter and used to transform a Fusarium 
venenatiim recipient strain. The secreted recombinant phytase protein was enzymatically active between pHs 3 
and 7.5, with a specific activity of 110 |xmol of inorganic phosphate released per min per mg of protein at pH 
6 and 37°C. The Thermomyces phytase retained activity at assay temperatures up to 75°C and demonstrated 
superior catalytic efficiency to any known fungal phytase at 65°C (the temperature optimum). Comparison of 
this new Thermomyces catalyst with the well-known Aspergillus niger phytase reveals other favorable properties 
for the enzyme derived from the thermophilic gene donor, including catalytic activity over an expanded pH 
range. 



Phytases (myoinositol hexakisphosphate phosphohydro- 
lases; EC 3.1.3.8) catalyze the hydrolysis of phytic acid (myo- 
inositol hexakisphosphate) to the mono-, di-, tri-, tetra-, and 
pentaphosphates of myoinositol and inorganic phosphate. A 
broad range of microorganisms, including bacteria (20), yeasts 
(2), and filamentous fungi (10 ? 19, 27), produce phytases. 

Phytic acid is the primary storage form of phosphate in 
cereal grains, legumes, and oilseeds, such as soy, which are the 
principal components of animal feeds. However, monogastric 
animals are unable to metabolize phytic acid and largely ex- 
crete it in their manure. Therefore, the presence of phytic acid 
in animal feeds for chickens and pigs is undesirable, because 
the phosphate moieties of phytic acid chelate essential miner- 
als and possibly proteins, rendering the nutrients unavailable. 
Since phosphorus is an essential element for the growth of all 
organisms, livestock feed must be supplemented with inorganic 
phosphate. There are a number of published reports (12, 16, 
18, 26) describing the use of phytases in the feeds of mono- 
gastric animals and in human food. 

When phytic acid is not metabolized by monogastric animals 
the phosphate level in the manure can also create disposal 
problems. The amount of manure produced worldwide has 
increased significantly as a result of increased livestock pro- 
duction. Environmental pollution with high-phosphate manure 
has caused problems in various locations around the world due 
to the accumulation of phosphate, particularly in bodies of 
water. Consequently, animal feed distributors in Europe have 
begun to formulate feed products with supplemental phytase in 
order to improve feedlot productivity and decrease phosphate 
waste. Thus, phytases are also useful for reducing the amount 
of phytate in manure (13, 18). The current commercial feed 
supplement is a recombinant Aspergillus niger (previously As- 
pergillus ficuum) phytase produced in Aspergillus niger (27) or 
Aspergillus oryzae (i.e., Phytase Novo [13]). 
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There is a definite commercial need for second-generation 
phytases with improved properties (e.g., higher thermostability 
and catalytic efficiency) that can be produced in commercially 
significant quantities. Our objectives were to identify, clone, 
and characterize a phytase from a thermophilic fungus in an- 
ticipation that this enzyme would offer superior biochemical 
properties, 

MATERIALS AND METHODS 

DNA extraction and hybridization analysis. Total cellular DNA was extracted 
from Thermomyces lanuginosus CBS 586.94 by the procedure described by Tim- 
berlake and Bernard (21), Genomic DNA samples were analyzed by Southern 
hybridization (6) under conditions of low stringency (i.e., 5x SSPE [IX SSPE is 
0.18 M NaCl, 10 mM NaH 2 P0 4 , and 1 mM EDTA {pH 7.7}], 25% formamide, 
0.3% sodium dodecyl sulfate [SDS]). A phytase-specific probe fragment com- 
prising the Aspergillus niger phyA coding region (approximately 1.6 kb) was 
radiolabeled by nick translation (11) with [a- 32 P]dCTP (Amersham, Arlington 
Heights, 111.) and added to the hybridization buffer at an activity of approximately 
10 6 cpm per ml. The hybridization and washing conditions have been described 
previously (4). 

DNA libraries and identification of phytase clones. Genomic DNA libraries 
were constructed with the bacteriophage cloning vector XZipLox (Life Technol- 
ogies, Gaithersburg, Md.) with Escherichia colt Y1090ZL cells (Life Technolo- 
gies) as a host for plating and purification of recombinant bacteriophages and 
E. coli DHIOBzip (Life Technologies) for excision of individual pZLl -phytase 
clones. Total cellular DNA was partially digested with 7jsp509I and size fraction- 
ated on 1 % agarose gels. DNA fragments migrating in the range of 3 to 7 kb were 
excised and el u ted from the gel with Prep-a-Gene reagents (Bio-Rad Laborato- 
ries, Hercules. Calif.). The e luted DNA fragments were Hgated with EcoKl- 
cleaved and dephosphorylated XZipLox vector arms (Life Technologies), and the 
ligation mixtures were packaged with commercial packaging extracts (Strat- 
agene, La Jolla, Calif.). The packaged DNA libraries were plated and amplified 
in E. coli Y1090ZL cells (Life Technologies). Approximately 30,000 plaques 
from the library were screened by plaque hybridization with the radiolabeled 
phytase probe. One positive clone which hybridizes strongly to the probe was 
picked and purified twice in E. coli Y1090ZL cells. The phytase clone was 
subsequently excised from the XZipLox vector as a pZLl -phytase clone (5) and 
designated pMWR46. 

Molecular analysis of the T, lanuginosus phytase gene. Restriction mapping of 
pMWR46 was performed by standard methods (11). DNA sequencing of the 
phytase clones was performed with model 373 A automated DNA sequencer 
(Applied Biosys terns, Inc., Foster City, Calif.) by the primer-walking technique 
with dye-terminator chemistry (7). In addition to the lac forward and lac reverse 
primers, specific oligonucleotide sequencing primers were synthesized on an 
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Applied Biosystems model 394 DNA-RNA synthesizer according to the manu- 
facturer's instructions. 

Construction of the phytase expression vector pMWR48. The coding region of 
the T. lanuginosus phyA gene was amplified by PCR with the forward primer 
5 ' - ATITAAATGG CG G G G ATAGGTTTG G-3 ' and the reverse primer 5'-CT 
TAATTAATCAAAAGCAGCGATCCC-3'. The sense primer incorporated the 
first in-frame ATG and extends 16 bp downstream. The antisense primer incor- 
porated a region 14 bp upstream of the translational stop codon and extends 
through the stop codon. To facilitate the cloning of the amplified fragment, the 
sense and antisense primers contain a Swal and a Pad restriction site, respec- 
tively. The amplified product was digested with Swal and Pad and ligated with 
pDM181 (also digested with Swa\ and Pad\ a plasmid which provides the 
Fusarium oxysporum trypsin gene promoter and terminator and the bar resistance 
cassette (3). The resulting expression vector was designated pM WR48. 

Transformation of Fusarium venenatum and analysis of transformants. Trans- 
formation protocols and methods for purification of F. venenatum (28) transfor- 
mants are described by Royer et al. (15). Mycelia from primary transformants 
were used to inoculate shake flasks containing 25 ml of M400 Da medium (50 g 
of maltodextrin, 2 g of MgS0 4 • 7H 2 0, 2 g of KH 2 P0 4 , 4 g of citric acid, 8 g of 
yeast extract, 2 g of urea, and 0.5 ml of trace metal solution per liter [15]) and 
incubated with shaking at 30°C. One milliliter of culture supernatant was har- 
vested at 4, 5, and 7 days and stored at 4°C Phytase activity was assayed as 
described below. Spores from the primary transformants producing the highest 
phytase activity were generated by inoculating 20 ml of R medium (12.1 g of 
NaNO.,/liter, 50 g of succinic acid/liter, 20 ml of 50 x Vogel's salts, 25 mM 
NaN0 3 [pH 6.0] [15]) with mycelia and incubating it at 30°C with shaking for 2 
to 3 days. Single spores were isolated by spreading 150 ml of spore culture onto 
manipulator plates (IX Vogel's salts, 25 ml of NaNO-,, 2.5% sucrose, 2% Nobie 
agar) containing 5 mg of Basta [phosphinothricin or 2-amino-4-(hydroxymethyl- 
phosphinyl)butanoic acid; Hoechst-Schering. Rodovre, Denmark] per ml and 
using a micromanipulator to transfer single spores to a clear region of the plate. 
After 3 days of growth at room temperature, the germinated spores were trans- 
ferred to individual Vogel plates containing 5 mg of Basta/ml. Shake flasks 
containing 25 ml of M400Da medium plus 5 mg of Basta/ml were inoculated in 
duplicate with mycelial plugs from each single-spore isolate and incubated at 
30°C. The best single-spore isolate was selected based on assay of the secreted 
enzymatic activity, where the transformants produced > 150-fold more phytase 
activity than an untransformed control. 

Protein purification. The best F. venenatum transfer mant was run in two 
2-liter fermentors with a standard protocol (3). The frozen cell-free broth (1,700 
ml) was thawed, clarified by centrifugation, and concentrated on a hollow-fiber 
Amicon filtration unit with an SI Y10 filter to a volume of 350 ml. The sample 
was adjusted to pH 7, diluted to a conductivity of 2 mS, and chromatographed at 
room temperature on a 75-m!-bed-volume Q-Sepharose Big Beads column 
(Pharmacia), which had been equilibrated in 20 mM Tris-Cl, pH 7. The column 
was developed at 5 ml/m in with the equilibration buffer until the effluent >428o 
had decreased to near baseline. The column was then developed at 5 ml/min with 
a 600-ml gradient of 0 to 0.6 M NaCl in the same buffer. The bound enzyme 
activity was found to elute in fractions corresponding to ca. 0.2 M NaCl. 

The collected activity peak was concentrated by ultrafiltration with a PM-10 
membrane to a volume of 25 ml, diluted to a conductivity of 0.9 mS, and 
chromatographed at 4 ml/min on a MonoQ HR 10/16 column which had been 
equilibrated in 20 mM MOPS (morpholinepropanesulfonic acid), pH 7. The 
column was developed with 80 ml of starting buffer and then with a 400-ml 
gradient of 0 to 0.5 M NaCl in the same buffer. Enzyme activity was detected in 
fractions by using the p-nitrophenyl phosphate measurement described below. 
The active fractions were also analyzed with a Novex 10 to 27% gradient SDS- 
polyacrytamide gel, and the fractions were combined if judged by electrophoresis 
to be substantially purified. 

The peak fractions were combined, concentrated with an Amicon PM-10 
membrane by ultrafiltration, and exchanged into 20 mM MES (moTpholine 
ethanesulfonic acid), pH 5.5. The sample conductivity was 1.1 mS. One-third of 
this sample was chromatographed at 1 ml/min on a Mono S HR 5/5 column 
(Pharmacia) which had been equilibrated in the same buffer. The column was 
developed with 5 ml of starting buffer and then with a 25-ml linear gradient of 0 
to 0.6 M sodium chloride in the same buffer. The active fractions were combined 
after electrophoretic analysis to eliminate those which contained trace contam- 
inants. 

Physicochemical characterization. Isoelectric focusing (IEF) was performed 
with a Novex pH 3 to 7 IEF gel according to the instructions of the manufacturer. 
IEF standards from both Pharmacia and Bio-Rad were used to calibrate the gel 

The protein extinction coefficient was determined experimentally by quantita- 
tive amino acid analysis with a Hewlett-Packard AminoQuant system. The anal- 
ysis assumed 49,700 for the protein molecular weight, based on the translated 
gene sequence for the mature protein. 

Amino-terminal sequence analysis was performed on an Applied Biosystems 
476A sequencer. 

Enzyme assays. Phytase activity was measured by two different methods. Dur- 
ing purification, fractions were rapidly evaluated by measuring the rate of p- 
nitrophenyl phosphate hydrolysis at 405 nm with 10 mM substrate in 0.2 M 
sodium citrate, pH 5.5, at 30°C with a plate reader (Thermomax; Molecular 
Devices). 
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Enzyme kinetics studies performed on purified enzyme samples were accom- 
plished by the assay of inorganic phosphate liberated from corn phytic acid 
(Sigma catalog no. P 8810). Exhaustive phytate hydrolysis was accomplished by 
incubating 0.5 or 0.1% phytic acid with enzyme (1 U/ml) in 0.2 M sodium citrate, 
pH 5.5, at 37°C. Aliquots were removed over a period of 10 h and analyzed (see 
below) for kinetics. of phosphorus release. Ten hours was found to be sufficient 
for the completion of product formation. Standard enzyme kinetics reactions 
were carried out for 30 min at 37 Q C in 0.5% (wt/wt) phytic acid. The reaction was 
quenched by the addition of an equal volume of 15% (wt/wt) trichloroacetic acid. 
After cooling, 100 u.1 of the resulting mixture was diluted in 1 ml of water. The 
sample was incubated at 50°C for 5 min. Color reagent (1 ml) was added, and the 
50°C incubation was continued for 15 min. The absorbance of a 200-u-l aliquot 
was measured at 690 nm with a microplate reader. The color reagent was 
composed of 6 N sulfuric acid-water-2.5% (wt/vol) hepta-ammonium molyb- 
date-10% ascorbate (aqueous) in a ratio of 1:2:1:1 and was prepared fresh daily. 
Quantitation was based on a standard curve generated with a 10 mM sodium 
monobasic phosphate standard. One unit is defined as 1 u.mol of inorganic 
phosphate released per mm with 0.5% phytic acid in 0.2 M sodium citrate, pH 
5.5, at 37°C. 

Steady-state kinetics measurements were made by substrate titration. Phytate 
concentrations were 2.16, 1.08, 0.541, 0.216, 0.108, and 0.0758 mM for K m 
determination. Phytate concentrations of 1.08, 0.541, 0.216, and 0.108 mM in the 
presence or absence of 1 mM sodium monobasic phosphate were used to eval- 
uate product inhibition. 

Thermostability measurement. Phytase samples were dissolved at 100 U per 
ml in 0.2 M sodium citrate, pH 5.5. One hundred-rnicroliter aliquots of each 
enzyme solution were incubated for 20 min in a water bath at 37, 45, 50, 55, 60, 
65, 70, and 75°C. After the heat treatment, the samples were stored at 0°C until 
activity assays were performed. Each sample was diluted 1:80 in 0.2 M sodium 
citrate, pH 5.5, containing 0.01% (wt/wt) Tween 20, and the standard activity 
assay was performed. 

pH-activity measurement. To attain a buffering range between pHs 2 and 7, a 
three-component 125 mM glycine-acetate-cttrate buffer was employed. The 
buffer components were combined at final concentrations of 42 mM per com- 
ponent, and phytic acid was added as a solid to 1% (wt/wt). This mixture was 
adjusted to pH 7 with concentrated HC1, and a 10-ml aliquot was taken. This 
process was repeated for every 0.5 pH units through pH 2. 

Enzyme stock solutions of 20 U per ml were prepared in 20 mM MES buffer, 
pH 5.5. Substrate (1% [wt/wt]; 850 p.1) in buffer at a given pH was combined with 
100 u,l of water and 50 u.1 of enzyme stock solution and incubated for 30 min at 
37 C C. Subsequently, the enzyme reaction was quenched with 1 ml of 15% tri- 
chloroacetic acid and quantitated by the standard method. 

Temperature-activity measurement. Enzyme stock solutions of 1 2.5 U per ml 
were prepared in 0.2 M sodium citrate buffer, pH 5.5. Two hundred fifty micro- 
liters of 1% phytic acid substrate was added to a 1.7-ml Eppendorf tube followed 
by 240 u,l of 0.2 M sodium citrate buffer, pH 5.5. This solution was vortexed and 
placed in a water bath at the designated temperature. After 20 min of equili- 
bration in the water bath, the mixture was vortexed and 10 jjJ of phytase solution 
was added. The sample was vortexed and incubated in the water bath for an 
additional 30 min, and then the reaction was quenched with 1 ml of 15% 
trichloroacetic acid and quantitated by the standard method. 

Nucleotide sequence accession number. The complete phyA gene sequence has 
been deposited in GENESEQN as accession no. T90070. 



RESULTS 

Cloning of phytase gene sequences from T. lanuginosus. 
Southern blotting experiments indicated that an Aspergillus 
phytase gene fragment could be used as a probe to identify 
phytase gene-specific fragments in T. lanuginosus genomic 
DNA (Fig. 1). We screened 30,000 plaques from a genomic 
library of T. lanuginosus DNA constructed in XZipLox for 
hybridization with the Aspergillus phytase gene probe. Several 
positive clones were picked and excised by an in vivo-excision 
protocol (5). 

Analysis of the T t lanuginosus phyA gene. DNA sequencing 
of one T. lanuginosus phytase clone (pMWR46) showed an 
open reading frame similar to the A. niger phytase gene. The 
positions of introns and exons within the phyA gene were 
assigned based on comparison of the deduced amino acid 
sequence with the deduced amino acid sequence of the corre- 
sponding A. niger phytase gene product. On the basis of this 
analysis, the T. lanuginosus phytase gene is comprised of two 
exons (47 and 1,377 bp), which are separated by a small intron 
(56 bp). The size and composition of the intron is consistent 
with those of other fungal genes (9) in that all contain consen- 
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FIG. 1. Autoradiogram from Southern hybridization analysis of T. lanugino- 
sus genomic DNA with an Aspergillus phytase gene probe. Lanes t and 2, A. niger 
genomic DNA digested with Arm HI and BamHl plus Pst I, respectively; lanes 3 
and 4, Myceliophthora thermophila genomic DNA digested with BamHl and 
BamHl plus Pst\, respectively; lanes 5 and 6, Thielavia terrestris genomic DNA 
cleaved with 0am HI and Bam HI plus ft/ 1, respectively; lanes 7 and 8, T. lanugi- 
nosus genomic DNA cut with BamHl and BamHl plus ft/I, respectively. 
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FIG. 3. Phytase thermal stability. Comparison of residual enzyme activity 
after a 20-min incubation at various temperatures. Full activity corresponds to 10 
U. Solid bar, A. niger phytase; cross-hatched bar, T. lanuginosus phytase. 



sus splice donor and acceptor sequences as well as a near 
approximation of the consensus lariat sequence (RCTRAC) 
near the 3' end of each intervening sequence. 

The deduced amino acid sequence of the T. lanuginosus 
gene product shows the characteristics of an extracellular fun- 
gal enzyme with a cleavable signal sequence. Based on the 
rules of von Heijne (25), the first 22 amino acids of PhyA likely 
comprise a secretory signal peptide which directs the nascent 
polypeptide into the endoplasmic reticulum. Amino-terminal 
amino acid sequencing suggests that the next 10 amino acids 
constitute a propeptide which terminates with a dibasic cleav- 
age site (LysLys). The mature PhyA is an acidic protein (pre- 
dicted isoelectric point, 5.4) composed of 452 amino acids 
(molecular mass, 51 kDa). The amino acid sequence also con- 
tains the active-site motif RHGXRXP, which is shared by 
other known phytases and acid phosphatases (Fig. 2) (23, 27). 
Lastly, the deduced amino acid sequence of the mature PhyA 
has approximately 47.5% identity with the phytase from A. 
niger (GenBank accession no. M94550). 

Analysis of f. venenatum trans forma nts expressing T. 
lanuginosus phytase. F. venenatum has recently been developed 
as an efficient fungal host for the production of heterologous 
proteins (15). Culture supernatants from 14 of the 17 primary 
transformants of pMWR48 were positive when assayed for 
phytase activity. Two primary transformants with the highest 
phytase activity were selected for single-spore isolation, and 
nine single-spore isolates were obtained. 

Physicochemical characterization of the recombinant 
phytase. The purified T. lanuginosus phytase was apparently 
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FIG. 2. Alignment of putative active-site regions of acid phosphatases (AP) 
and phytases from various species. The M. thermophila (Myceliophth; TREMBL 
000107), Talaromyces thermophilic (Talaromvc; TREMBL 000096), A. fumiga- 
tus (TREMBL 000092), A. ficuum (A. niger) (SwissProt P34752 and P34754), 
Saccharomyces cerevisiae (YScAP3 and -5; SwissProt P24031 and P00635), hu- 
man (HuPAP and HuLAP; SwissProt P15309 and Pill 17), and E. coli 
(SwissProt P07102) sequences were obtained from the databases indicated. The 
numbers in parentheses are the starting amino acid positions from the mature 
proteins for the sequences compared. Identical amino acids are boxed. Tfiermo- 
cyl, T. lanuginosus. 



homogeneous in SDS-polyacrylamide gel electrophoresis, with 
a single component corresponding to a molecular weight of 
60,000. The protein sample contained numerous components 
in IEF analysis ranging from pH 4.7 to 5.2. In contrast to the 
T. lanuginosus phytase, recombinant A. niger phytase is com- 
posed of a single major component with a pi near 4.9 and two 
minor bands around pi 4.7. 

Amino-terminal sequence analysis of the purified T. lanugi- 
nosus enzyme identified three components: the major compo- 
nent (ca. 60%) is H 2 N-His-Pro-Asn-Val-Asp-Iie-Ala-Arg-His- 
Trp-Gly-Gln. . ., which corresponds to a Kex2 cleavage site at 
position 34 in the primary translation product. Two minor 
sequences, H,N-Gly-Glu-Asp-Glu-Pro-Phe-Val-Arg-Val-Leu- 
Val-Asn. . .(ca" 30%) and H 2 N-Ser-Glu-Glu-Glu-Glu-Glu-Giy- 
Glu-Asp-Glu-Pro-Phe. . .(ca. 10%), correspond to internal 
cleavage sites near the COOH terminal of the protein at po- 
sitions 428 and 435 in the primary translation product. The 
observation that our protein sequence data exactly match the 
predicted translation product of the T. lanuginosus gene and 
the finding that untransformed Fusarium host strains produce 
2 orders of magnitude less enzyme activity both argue strongly 
that we have isolated a heterologous gene product. 

The specific activities for the two recombinant phytases (i.e., 
those of T. lanuginosus and A. niger) were 91 and 180 U/mg, 
respectively, under standard assay conditions at pH 5.5. At its 
pH 6 optimum T. lanuginosus phytase had a specific activity of 
110 jamol of inorganic phosphate released per min per mg of 
protein at 37°C. Exhaustive enzymatic hydrolysis of phytic acid 
revealed that A. niger and T. lanuginosus phytases released 
identical amounts (70%) of the total theoretically available 
phosphorus. Steady-state kinetic measurements disclosed that 
the apparent K m of T, lanuginosus phytase is approximately 110 
IxM with respect to phytate while A. niger has an apparent K m 
of 200 jxM. There was a faint indication of excess substrate 
inhibition at the 2.16 mM substrate concentration, perhaps 
congruent with the report of inhibition above 2 mM for A. niger 
phytase (22). Steady-state kinetics measurements with 1 mM 
phosphate present failed to reveal any type of inhibition with 
this product. We estimate that the K t for phosphate must 
exceed 3 mM to be undetectable in our experiments. In con- 
trast Ullah (22) has reported that phosphate is a competitive 
inhibitor, with a K, of 1.9 mM. 

A comparison of enzyme thermostability profiles (Fig. 3) 
suggests that differences between the stabilities of the two 
enzymes are small. Neither enzyme is fully inactivated by a 
high-temperature incubation, and the residual activity profiles 
are consistent with partially reversible thermal denaturation 
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incubation pH 

FIG. 4. Phytase pH-activity profile; comparison of relative enzyme activity at 
various incubation pHs. A relative activity of 100% corresponds to 1 and 1.21 
u-mol of inorganic phosphate released per min for A. niger and T. lanuginosus 
phytases, respectively. Solid square, A. niger phytase; open circles, T. lanuginosus 
phytase. 



(24). Differential scanning calorimetry (DSC) experiments re- 
veal that the A. niger enzyme has a transition at 60°C while T. 
lanuginosus phytase unfolds at 69°C. Others have reported an 
Aspergillus fumigatus phytase which has an apparently greater 
propensity for reversible thermal denaturation (14), as mea- 
sured by residual enzyme activity. However, there are no pub- 
lished data on thermal denaturation points for the A fumigatus 
phytase or other phytase species. 

The pH-activity profile comparison of T. lanuginosus and A. 
niger phytases indicates substantial similarity between the pH 
profiles of the two enzymes (Fig. 4). However, the T. lanugi- 
nosus enzyme is active at neutral pH while the A niger enzyme 
is not. We could not reproduce the earlier reports (e.g., refer- 
ence 17) that A. niger phytase possesses two pH optima; em- 
ploying a composite buffer, we measured a broad shoulder 
near pH 3. We note that there are very few cases of a single 
enzyme species possessing two pH optima. The earlier reports 
may originate from impure material which contains traces of 
the A. niger acid phosphatase (29), or they could be artifacts of 
employing more than one buffer to span the pH range. 

Measurement of enzyme activity as a function of tempera- 
ture revealed a significant difference between the two enzymes 
(Fig. 5). 7. lanuginosus phytase has maximum enzyme activity 
near 65°C and has partial activity even at 75°C. In contrast, A. 
niger phytase is essentially inactive at 65°C. These results are 
congruent with the DSC data for the two enzymes, which also 
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FIG. 5. Phytase temperature-activity measurement; observed enzyme activity 
as a function of incubation temperature. A relative activity of 100% corresponds 
to 0.125 fxmol of inorganic phosphate released per min. Solid squares,/!, niger 
phytase; open circles, T. lanuginosus phytase. 
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indicate a 9°C stability improvement for the Thermomyces 
phytase. 

DISCUSSION 

Enzyme activity at elevated temperatures may be relevant in 
applications such as saccharification (a high-temperature in- 
dustrial process to generate high-fructose corn syrup), where 
others have reported that the addition of phytase improves 
carbohydrate yields (1). Figure 5 demonstrates that at 55°C, 
the optimal temperature for A. niger phytase, the Thermomyces 
phytase performs at 79% of the A. niger phytase turnover 
number (despite lower specific activity for Thermomyces 
phytase at 37°C) and at 60°C the Thermomyces phytase is 
operating at 67%-greater catalytic efficiency than the A. niger 
enzyme. The A. niger phytase is inactivated at 65°C, where 
Thermomyces phytase activity is maximal. 

Enzyme thermal stability is also relevant in animal feed 
applications, where the enzyme is normally incorporated into 
the grains prior to pelletization and the feed briefly reaches 
processing temperatures of 85 to 90°C. In this circumstance a 
commercial phytase product must be able to withstand brief 
heating prior to encountering an animal's digestive tract at 
37°C. Our physicochemical data demonstrate an improvement 
of approximately 9°C in denaturation temperature for Ther- 
momyces phytase versus the present A. niger product. 

Animal-feeding trials with formulated phytase supplemen- 
tation would involve testing a total of 300 broilers or piglets at 
two enzyme dosages plus a control without enzyme addition. 
Typically the apparent total-tract digestibility of dissolved mat- 
ter, organic matter, nitrogen, calcium, and total phosphorus 
would be monitored at one or two points during an animal's 
growth to determine the effect of enzyme dosage on feed 
intake and conversion. Such animal-feeding trials and the level 
of analysis required to present and evaluate the data are be- 
yond the scope of this paper. 

It is tempting to speculate about the structural origins of 
thermal stability in phytases. However, there is no obvious 
pattern to the sequence differences between phytases from 
thermophiles (represented by Myceliophthora, Talaromyces, 
and Thermomyces) and mesophiles (represented by A. niger 
and A fumigatus). For example, there are no gross differences 
in protein structure, such as addition or deletion of secondary 
structure elements. Nor is there a systematic pattern to the 
sequence differences between the two representative enzymes; 
i.e., hydrophobic replacements, addition of salt bridges, addi- 
tion of potential disulfide bonding sites, and deletion of aspar- 
agine or aspartate residues are not readily apparent. The most 
striking difference is the additional consensus N-linked glyco- 
sylation site present in the two Aspergillus enzymes (sequence 
position 231 in reference 27) but missing in the three thermo- 
phile examples. We believe that the most likely explanation 
which can be deduced for the sequence differences is derived 
from evolutionary rather than functional factors. 

Recently the discovery of new industrial enzymes has fo- 
cused on novel microbial sources representing extreme condi- 
tions (extremophiles). In many cases the genes encoding these 
interesting enzymes can be cloned without prior isolation of 
the catalyst or culturing of the donor microbe. However, het- 
erologous production of the novel enzyme often results in 
extremely low yields of secreted product or accumulation of 
inactive material as inclusion bodies. Either of these outcomes 
is incompatible with the production economics required for 
commercialization. We have searched for new industrial cata- 
lysts from a constellation of thermophilic fungi that are more 
closely related than the extremophiles to the industrial fungal 
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production strains which are available. We have successfully 
isolated enzymes with both improved thermal stability charac- 
teristics and the potential for high-level commercial produc- 
tion (4). 

T. lanuginosus phytase is an alternative enzyme with perfor- 
mance advantages over the conventional A. niger enzyme in the 
form of stable enzyme activity at elevated temperatures and 
superior substrate saturation kinetics at physiological pH, A 
second-generation commercial enzyme may also benefit from 
protein engineering when a three-dimensional protein struc- 
ture is available, as is the case for the A. fumigatus enzyme (8). 
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Nucleotide sequence analysis shows that Trichoderma 
harzianum and Penicillium purpurogenum ofl,3-glu- 
canases (mutanases) have homologous primary struc- 
tures (53% amino acid sequence identity), and are com- 
posed of two distinct domains: a NH 2 -terminal catalytic 
domain and a putative COOH-terminal polysaccharide- 
binding domain separated by a O- glycosylated Pro-Ser- 
Thr-rich linker peptide. Each mutanase was expressed 
in Aspergillus oryzae host under the transcriptional con- 
trol of a strong a-amylase gene promoter. The purified 
recombinant mutanases show a pH optimum in the 
range from pH 3.5 to 4.5 and a temperature optimum 
around 50-55 °C at pH 5.5. Also, they exhibit strong bind- 
ing to insoluble mutan with K D around 0.11 and 0.13 /m 
at pH 7 for the P. purpurogenum and T. harzianum 
mutanases, respectively. Partial hydrolysis showed that 
the COOH-terminal domain of the T. harzianum muta- 
nase binds to mutan. The catalytic domains and the 
binding domains were assigned to a new family of gly- 
coside hydrolases and to a new family of carbohydrate- 
binding domains, respectively. 



Extracellular polysaccharides produced by microbial flora in 
the human oral cavity are believed to play an important role in 
the adherence and proliferation of bacterial aggregates on the 
surface of teeth (1). Consequently, these polysaccharides might 
have significance in the development of tartar, plaque, and 
possibly dental carries (2). Mutan is a major component of 
exopolysaccharides produced by tooth colonizing streptococci 
such as Streptococcus mutans (3). Mutan is composed of a 1,3- 
glucan with some al,6-glucan (dextran) side chains. Mutanase 
(al,3-glucanase, EC 3.2.1.59) and dextranase (al,6-glucanase, 
EC 3.2.1.11) enzymes could be beneficial additives to dentifrice 
preparations as it has been shown that these enzymes are 
capable of removing biofilms created by oral bacteria in vitro (4) 
and reducing plaque formation in vivo (5,6). Mutanase activity 
from the filamentous fungus Trichoderma harzianum was first 
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described by Guggenheim and Haller (7). However, only a 
limited number of reports are available on the characterization 
of fungal mutanases. Here we describe the cloning, expression, 
and subsequent characterization of two fungal mutanases rep- 
resenting a new family of fungal endoglucanases with their 
unique mutan-binding domains. 

EXPERIMENTAL PROCEDURES 

Fungal Strains — T. harzianum strain CBS 243.71 and Penicillium 
purpurogenum CBS 238.95 were used as the sources of genomic DNA. 
Aspergillus oryzae JaLl42 and JaL125 (obtained from J. Lehmbeck, Novo 
Nordisk A/S), were alkaline pro tease-deficient strains used for heterolo- 
gous expression of cloned mutanases. The A. oryzae strains show no 
detectable level of background mutanase activity in the assay described. 

Purification and Charcaterization of the Wild-type Mutanase from T. 
harzianum— 100 g of SP234 (Novo-Nordisk A/S, batch number PPM 3897) 
were dissolved in 1 liter 10 mM sodium acetate, pH 5.2. Contaminant 
proteins were removed by batch adsorption on DEAE-Sephadex, then by 
batch adsorption on S-Seph arose (Amersham Pharmacia Biotech). After 
concentration on a Filtron concentrator equipped with a 10-kDa cut-off 
membrane, the unbound material was applied to a S-Sepharose (Amer- 
sham Pharmacia Biotech) column (180 ml, 2.6 x 33 cm) equilibrated with 
10 mM sodium acetate, pH 4.7. The mutanase was eluted with a 0-20 mM 
linear gradient of NaCl in the same buffer (3 column volumes). The 
residua] protein was eluted with the same buffer containing 1 M NaCl. 
Fractions with high mutanase activity were pooled and concentrated. 
After the procedure was repeated 12 times, the pooled fractions were 
concentrated and placed in 10 mM Tris-HCl, pH 8.0. The mutanase was 
further purified on a HiLoad Q-Sepharose column (50 ml, 2.6 x 10 cm) 
equilibrated with 10 mM Tris-HCl, pH 8,0, and eluted with a linear 
gradient from 0 to 50 mM NaCl in 12 column volumes. Fractions with high 
mutanase activity were pooled and concentrated in an Amicon cell 
equipped with a 10-kDa cut-off membrane. Finally, the mutanase prepa- 
ration was dialyzed extensively against 10 mM sodium phosphate, pH 7.0. 
SDS-PAGE 1 gave one single band at 75 kDa (data not shown). 

Carbohydrate composition analysis was performed on lyophilized 
samples which were hydrolyzed in vacuo in sealed glass tubes using 100 
y\ of 2 m trifiuoroacetic acid for 1 h and 4 h at 100 °C. Monosaccharides 
were separated by high performance anion exchange chromatography 
using a Dionex Carbopac PA1 column eluted with 16 mM NaOH and 
detected by pulsed amperometric detection. 

The mutanase mass was measured using matrix-assisted laser des- 
orption ionization time-of-flight mass spectrometry (MALDI-MS) (VG 
Analytical). Typically 2 pi of sample were mixed with 2 /til of saturated 
matrix solution (a-cyano-4-hydroxycinnamic acid in 0.1% trifiuoroacetic 
acid:acetonitrile (70:30)) and 2 jxl of the mixture were deposited on the 
target plate. After evaporation of the solvent, the samples were intro- 
duced in the spectrometer. They were desorbed and ionized by 4-ns 
laser pulses (337 nm) and subjected to an accelerating voltage of 25 kV. 
Ions were detected by a microchannel plate set at 1850 V. 

Generation of a cDNA Probe for the T harzianum Mutanase Using 
Reverse Transcriptase PCR — T. harzianum was cultivated as described 
(8), A 2-liter sample was taken after 4 days of growth at 30 °C, and the 



1 The abbreviations used are: PAGE, polyacryl amide gel electro- 
phoresis; MALDI-MS, matrix-assisted laser desorption ionization-mass 
spectrometry; PCR, polymerase chain reaction; bp, base pair(s); CAPS, 
3-(cyclohexylamino)propanesulfonic acid; nt, nucleotide(s); MU, muta- 
nase unit; ORF, open reading frame. 
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mycelium was collected, frozen in liquid N 2 , and stored at -80 °C. 
First-strand cDNA was synthesized from 5 fxg of T. harzianum poly(A) + 
RNA as described earlier (9). A 387-bp fragment of the T. harzianum 
mutanase cDNA (10) was amplified using two mutanase- specific prim- 
ers (100 pmol each): forward (5'-ACTAAGCTTTATGTTCAAAAT- 
GAGCA-3') and reverse (5'-ACACTCTAGAACATATGGGTTGAAGT- 
TGT-3'), a DNA thermal cycler (Landgraf, Germany) and 2.5 units of 
Taq polymerase (Perkin-Elmer Cetus). Initially, two cycles of PCR were 
done using a cycle profile of denaturation at 94 °C for 1 min, annealing 
at 45 °C for 2 min, and extension at 72 °C for 3 min, then the annealing 
temperature was increased to 55 °C and 30 additional cycles were 
performed. The PCR fragment of interest was subcloned into pUC18 
vector and sequenced as described previously (9) 

Construction and Screening of the T. harzianum cDNA Library — 
Total RNA was prepared from frozen, powdered mycelium of T. harzia- 
num by extraction with guanidinium thiocyanate followed by ultracen- 
trifugation through a 5.7 M CbCI cushion (11). The poly(A) + RNA was 
isolated by oligo(dT)-cellulose affinity chromatography (12). Double- 
stranded cDNA was synthesized from 5 ug of T. harzianum poly(A) + 
RNA as described earlier (13), except that 25 ng of random hexanucle- 



ment from the mutanase cDNA clone containing amino acids 12 to 34 of 
the mutanase gene. The resulting plasmid pJW99 contains the cr-amy- 
lase promoter immediately upstream from the first 34 amino acids of 
the mutanase gene, followed by the A. niger gluco amylase terminator. 
To complete the expression vector the mutanase cDNA fragment was 
cleaved with Xhol and Sphl giving a 1790-nt fragment encoding amino 
acids 35-598. This fragment was ligated with pJW99 that had been 
linearized with Xhol plus Xbal and linker number 2, yielding the vector 
pMT1802, which contains the entire mutanase coding region under the 
transcriptional control of the A. oryzae «-amylase promoter and A. niger 
glucoamylase terminator. Plasmid pMTl796 is identical to pMT1802 
except that Glu-35 of the mutanase protein has been changed to Lys-35 
by replacing the XhoVKpnl fragment of pMT1802 with a PCR-amplified 
fragment containing this mutation. This PCR fragment was created in 
a two-step procedure as reported in Ref. 23 using the following primers: 
Primer 1 (nt 2761, 5'-CAGCGTCCACATCACGAGC, nt 2779) and 
Primer 2 (nt 3306, 5'-CAAGAAGCACGTTTCTCAGAGACCG, nt 3281); 
Primer 3 (nt 3281 5 '-CGGTCTCTGAGAAACGTGCTTCTTC, nt 3306) 
and Primer 4 (nt 4276, 5'-GCCACTTCCGTTATTAGCC, nt 4257); nu- 
cleotide numbers refer to the pMT1802 plasmid. 



GATCCTCACA ATG TTG GGC GTT GTC CGC CGT CTA GGC CTA GG 

GAGTGT TAC AAC CCG CAA CAG GCT GCA GAT CCG GAT CCG C 

Met Leu Gly Val Val Arg Arg Leu Gly Leu Gly 

Linker 1 

C CAA TAC TGT TAG T 

GT ACG GTT ATG ACA ATC AGATC 

Ala Cys Gin Tyr Cys *** 

Linker 2 



otide primers (Life Technologies, Inc.) were included in the first strand 
synthesis. A cDNA library, consisting of 1.5 X 10 6 independent clones 
was constructed in the yeast expression vector pYES 2.0 (Invitrogen) as 
described (13), and screened by colony hybridization (14) using a ran- 
dom-primed (15) 32 P-labeled (>1 x 10 9 cpm/fig) mutanase cDNA frag- 
ment as a probe. The hybridizations were carried out in 2 x SSC, 5 X 
Denhardt's solution (14), 0.5% (w/v) SDS, 100 p-g/ml denatured salmon 
sperm DNA for 24 h at 65 °C followed by washes in 2 x SSC (2 x 15 
min), 2 x SSC, 0.5% SDS (15 min), 0.2 x SSC, 0.5% SDS (15 min), and 
finally in 2 x SSC (2 x 15 min) at 65 °C. 

Cloning of P. purpurogenum Mutanase Gene — Total cellular DNA 
was isolated from P. purpurogenum cells by a previously described 
method (16), and used for construction of genomic DNA libraries in the 
bacteriophage A-ZipLox cloning system (Life Technologies Inc., Gaith- 
ersburg, MD) (17). Approximately 45,000 plaques from the library were 
screened by plaque hybridization (18) with a radiolabeled T. harzianum 
mutanase probe fragment using moderate stringency conditions (5 x 
SSPE, 35% formamide (v/v), 0.3% SDS, 200 denatured and 

sheared salmon testes DNA; hybridization temperature 45 °C. Mem- 
branes were washed once in 0.2 X SSPE with 0.1% SDS at 45 °C 
followed by two washes in 0.2 X SSPE (no SDS) at the same tempera- 
ture.) Plaques which gave hybridization signals were purified twice on 
Escherichia coli Y1090ZL cells, and the mutanase clones were subse- 
quently excised from the A-ZipLox vector as pZLl-derivatives (19). One 
such clone, designated pZL-Pp6A, was selected for further study. 

DNA Sequence Analysis — DNA sequencing was done with an Applied 
Biosystems Model 373A Automated DNA Sequencer (Applied Biosys- 
tems, Inc., Foster City, CA) using a combination of shotgun DNA 
sequencing (20) and the primer walking technique with dye-terminator 
chemistry (21). 

Construction ofT. harzianum Mutanase Expression Vector — The T. 
harzianum mutanase cDNA fragment was inserted in a two-step clon- 
ing procedure into an A. oryzae expression vector, pMHan37 (kindly 
provided by I. G. Clausen, Novo Nordisk A/S), which contains the A. 
nidulans amdS gene as a selectable marker, pUC plasmid sequences for 
replication in E. coli, an a-amylase gene promoter from A. oryzae, and 
the A. niger glucoamylase iglaA) terminator (22). In the first step, 
pMHan37 was linearizd with the restriction enzymes 2?coRI and Xhol. 
This fragment was ligated with the following three segments: 1) a 
618-nt fragment of the a-amylase promoter sequence bordered by an 
EcoKl site at the 5' end and a BamUl site at the 3' end; 2) linker 
number 1 listed below which has a Bamlll site at the 5' end and a Narl 
site at the 3' end. This linker includes the Met start codon and 12 amino 
acids of the mutanase signal sequence; and 3) a 68-nt NarUXhol frag- 



Expression of Recombinant T. harzianum Mutanase in A. oryzae — 
The A, oryzae host strain JaL125 was transformed using a polyethylene 
glycol-mediated protocol (24) and a DNA mixture containing 0.5 /xg of a 
plasmid encoding the gene that confers resistance to the herbicide 
Basta (25) and 8.0 fig of the expression vector pMT1796. Transformants 
were selected on minimal plates containing 0.5% Basta and 50 mM urea 
as a nitrogen source. Each transformant was purified twice on selection 
media and conidia were harvested. Universal containers (20 ml, Nunc, 
catalog number 364211) containing 10 ml of YPM (2% maltose, 1% 
bactopeptone bactopeptone, and 0.5% yeast extract) were inoculated 
with spores from the transformants and incubated 5 days with shaking 
at 30 °C. Culture supernatants were harvested after 5 days growth and 
assayed for the recombinant mutanase. 

Expression of P. purpurogenum Mutanase in A. oryzae — Two syn- 
thetic oligonucleotide primers were designed to amplify the P. purpu- 
rogenum mutanase gene from plasmid pZL-Pp6A, 5'-cccatttaaatATGA- 
AAGTCTCCAGTGCCTTC and 5 ' -ccctta attaaTTAG CTCTCTACTTG A- 
CAAGC (capital letters correspond to the sequence present in the 
mutanase coding region). One hundred picomoles of each primer was 
used in a PCR reaction containing 52 ng of plasmid DNA, lx Pwo 
polymerase buffer (Roche Molecular Biochemicals, Indianapolis, IN), 1 
mM each dATP, dTTP, dGTP, dCTP, and 2.5 units of Pwo polymerase 
(Roche Molecular Biochemicals). The PCR conditions were 95 C C 3 min, 
25x (95 °C 1 min, 60 ^C 1 min, 72 °C 1.5 min), 72 °C 5 min. The 
amplified 2.2-kilobase DNA fragment was purified by gel electrophore- 
sis and cut with restriction endonucleases Swal and Pad (using condi- 
tions specified by the manufacturers). The fragment was cloned into 
plasmid pBANe6 (26) that had been previously cut with Swal and Pad 
and the resultant expression plasmid was named pJeRS35. This vector 
was introduced into A. oryzae host strain JaL142 using a standard 
protoplast transformation procedure (24) and 40 transformants were 
selected by their ability to grow on COVE medium using acetamide as 
sole nitrogen source. The transformants were grown in 20 ml of MY50N 
media (MY50N in g/liter: Nutriose (Roquette), 62; MgS0 4 *7H 2 0, 2.0; 
KH 2 P0 4 , 2.0; citric acid, 4.0; yeast extract, 8.0; urea, 2.0; trace metals, 
0.5 ml; pH 6.0, and then add CaCl 2 , 0.1) in shaker flasks for 3 days at 
34 °C with agitation. Mutan assay plates were prepared by blending a 
suspension of 1% (v/w) mutan, 1% agarose in 0.1 m sodium acetate 
buffer, pH 5.5, for 20 min at 4 °C. The agarose was melted by heating 
and 150-mm Petri plates were poured. Afler solidification, small wells 
(about 40 fi\ equivalent volume) were punched in the plates. To screen 
the transformants for ability to secrete mutanase, 35 /til of centrifuge d 
culture broth from each transformant (and one un transformed control) 
were pipetted into the wells and the plates were incubated at 37 °C. 
Mutanase activity in the broth samples caused formation of clearing 
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zones around the wells. 

Preparation of Mutan — Mutan was prepared by growing S. mutans 
CBS 350.71 at 37 °C, pH 6.5 (kept constant at a stirring rate of 75 rpm 
in a medium comprised of the following components: NZ-Case, 6.5 
g/liter; yeast extract, 6 g/liter; (NH 4 ) 2 S0 4) 20 g/liter; KgP0 4 , 3 g/liter; 
glucose, 50 g/liter; pluronic PE6100, 0.1%). After 35 h, sucrose was 
added to a final concentration of 60 g/liter to induce gl u cosy 1 transferase. 
The total fermentation time was 75 h. The supernatant from this 
fermentation was centrifuged and filtered (sterile). Sucrose was added 
to the supernatant to a final concentration of 5% (pH was adjusted to 
pH 7.0 with acetic acid) and the solution was stirred overnight at 37 ^C. 
The solution was filtered and the insoluble mutan harvested on a 
Propex 23 filter (Scapa Filtration) and washed with deionized water 
containing 1% sodium ben zo ate, pH 5 (adjusted with acetic acid). Fi- 
nally, the insoluble mutan was lyophilized and ground. 

Enzyme Assays — The production of soluble reducing sugars released 
from mutan was employed as a measure of enzyme activity. First, 0.1 
ml of 5% mutan in 50 mM sodium acetate (allowed to swell at least for 
1 h), pH 5.5, was added to 0.3 ml of enzyme sample (diluted in water) in 
a round-bottomed Eppendorf vial to ensure sufficient agitation and 
incubated for 15 min at 40 °C while shaking vigorously. The reaction 
was terminated by adding 0.1 ml of 0.4 M NaOH and the samples were 
centrifuged for 5 min at 14,000 x g and filtered through 0.45-p.m 
HV-filters (Millipore). To each filtrate (100 jllI) in Eppendorf vials 750 (x\ 
of ferricyanide reagent (0,4 g/liter K^FefCN),;, 20 g/liter Na 2 C0 3 ) was 
added and incubated 15 min at 85 °C. After allowing the samples to 
cool, the decrease in absorbance at 420 nm was measured. A dilution 
series of glucose was included as a standard. Proper controls ( substrate 
and enzyme blanks) were always included. One mutanase unit (MU) 
was defined as the amount of enzyme releasing 1 /ttmol of reducing 
sugar per minute at pH 5.5 and 40 °C. Temperature profiles were 
obtained by incubating the assay mixture (50 mM sodium acetate, pH 
5.5) at various temperatures. The pH profiles were obtained by sus- 
pending the mutan in 50 mM buffer at various pH (glycine-HCl, pH 
3-3.5; sodium acetate, pH 4-5.5; and sodium phosphate, pH 6-7.5). 

Purification of Recombinant T. harzianum Mutanase — The fermen- 
tation broth (700 ml) containing 15.4 MU/ml was filtered using GF/A 
(Whatmann) and HV 0.45-/xm (Millipore) filters and concentrated on a 
Filtron concentrator equipped with a 10 kDa cut-off membrane. The pH 
was adjusted to 4.7 (conductivity approximately 300 microsiemens/cm), 
and the broth was loaded onto an S-Sepharose column (XK 50/22, 
Amersham Pharmacia Biotech) equilibrated in 10 mM sodium acetate, 
pH 4.7, The mutanase was eluted in a linear NaCl gradient. Fractions 
containing mutanase activity were pooled and concentrated on an Ami- 
con Cell (YM10) anil loaded onto a HiLoad Q-Sepharose column (Am- 
ersham Pharmacia Biotech) equilibrated in 10 mM Tris-HCl, pH 8.0 
(approximately 600 juS/cm), in three rounds. The mutanase was eluted in 
a linear gradient of NaCl. Pooled fractions (according to activity/purity) 
were concentrated and further purified by gel filtration on a Superdex 75 
(16/60) column (Amersham Pharmacia Biotech) in 0.1 M sodium acetate, 
pH 6.0. 

Purification of Recombinant P. purpurogenum Mutanase — The fer- 
mentation broth (780 ml) containing 2.2 MU/ml was filtered (0.45 /im; 
HV Millipore) and mixed with 15.6 g of mutan, washed in 0.1 m sodium 
acetate, pH 5.5, to provide a 2% solution. The pH was adjusted to 5.5 
and the suspension was allowed to stand at 4 °C for 1 h while stirring. 
The suspension was then filtered on a sintered glass filter funnel and 
the mutan was washed four times with 0.1 m sodium acetate, pH 5.5 
(total volume: 1110 ml), and then six times with Milli Q-filtered deion- 
ized water (total volume, 1250 ml); after each washing step the suspen- 
sion was filtered. The mutanase eluted during the washing with water. 
These filtrates were pooled, filtered (0.7 jxm, Whatman), concentrated 
on a Filtron concentrator equipped with a 10 kDa cut-off membrane, 
and further concentrated to 25 ml on an Amicon cell (YM10 membrane). 

Preparation of Binding Isotherms — Equilibrium binding was ascer- 
tained with 10 mg/ml mutan incubated at 4 °C with 0.5 MU mutanase 
in 10 mM Britton-Robinson buffer, pH 7. At various time points, sam- 
ples were taken and filtered (0.45-/xm HV, Millipore) prior to measuring 
the activity. Binding isotherms were obtained by incubating various 
concentrations of purified mutanase in a 0.2% suspension of mutan in 
0.1 m sodium phosphate, pH 7, for 1 h at 4 °C while stirring. The mutan 
was rinsed in buffer prior to use. Samples were then centrifuged for 10 
min at 15,000 x g and the amount of enzyme left in the supernatant 
determined by fluorescence spectrometry (Perkin-Elmer LS50) with 
excitation at 280 nm and emission at 345 nm. A fluorescence standard 
curve of the enzyme diluted in buffer was always included. Alterna- 
tively, the activity was measured in the supernatant and compared 
with the control. The data was fitted using the simple Langmuir theory 



for adsorption to a surface: A = (A mnx X E Crnn )/(K D + E rrw ), where A is 
the adsorbed protein, A innx is the maximum amount of protein which 
can be adsorbed to the surface, E rroc is the free protein and K D the 
equilibrium constant for ES <-* E + S (27). 

SDS-PAGE— SDS-PAGE was done with 4-20 or 8-16% gradient 
gels (Nov ex) according to the manufacturer's instructions. 

Differential Scanning Calorimetry — Samples for DSC were desalted 
into the appropriate buffer using NAP- 5 columns from Amersham 
Pharmacia Biotech. Final enzyme concentrations were in the range 
from 2 to 3 mg/ml. Samples were scanned from 20 to 90 °C using a scan 
rate of 907h at the MC-2 (MicroCal). 

Protein Sequencing — NH 2 -terminal amino acid sequencing was done 
using an Applied Biosys terns 47 3 A protein sequencer according to the 
manufacturer's instructions. 

Isola tion and Mutan Binding Activity of the COOH terminal Domain 
from the T. harzianum Mutanase — Mutanase was incubated for 2.5 h at 
30 °C with chymotrypsin (Roche Molecular Biochemicals) in a ratio of 
100:1 (mutanase:chymotrypsin, w/w) in 50 mM NH 4 HC0 3 . The digest 
was investigated on SDS-PAGE and the 41-kDa band observed was 
electroblotted from SDS-PAGE onto a Millipore Immobilin P SQ polyvi- 
nylidene difluoride membrane in 10 mM CAPS, 6% methanol at 175 raA 
for 3 h and subjected to NH 2 - terminal amino acid sequencing, revealing 
a sequence of SLTIGL- corresponding to proteolytic cleavage after 
amino acid residue Phe-473. A 50- /xl sample of the digest was incubated 
for 30 min at room temperature with 50 jxl of 2.5% mutan suspension. 
The sample was centrifuged for 2 min at 15,000 X g. A 30-/ll1 volume of 
supernatant was then analyzed by SDS-PAGE (Novex 4-20%). Controls 
without mutan were included. 

RESULTS 

Wild-type T> harzianum Mutanase — Purified wild-type T. 
harzianum mutanase displayed a molecular mass of 75 kDa 
both in SDS-PAGE and MALDI-MS. Carbohydrate composition 
analysis revealed only glucose and mannose but no 7V-acetyl- 
glucosamine, indicating O-glycosylation. The amount of glu- 
cose and mannose (18 and 32 mol/mol enzyme, respectively) 
accounts for over 8 kDa which, added to the theoretical mass 
(63.8 kDa), gives a molecular mass of about 72 kDa in close 
agreement with the 75 kDa measured by MALDI-MS and 
SDS-PAGE. 

Isolation and Characterization of cDNA Clones Encoding the 
Mutanase from T. harzianum — To obtain a cDNA probe for the 
T. harzianum mutanase, two oligonucleotides based on a 
genomic mutanase clone from T. harzianum (10) were de- 
signed. These primers were used to amplify a mutanase cDNA 
fragment from T. harzianum first-strand cDNA employing the 
PCR technique (28). Sequencing of the subcloned PCR frag- 
ment revealed a 387-bp cDNA with an open reading frame of 
129 amino acids. In addition to the primer-encoded residues, 
the ORF was identical to the corresponding region in the T. 
harzianum mutanase amino acid sequence (10), confirming 
that the PCR had specifically amplified the desired cDNA spe- 
cies. Approximately 10,000 colonies from a T. harzianum cDNA 
library in E. coli were screened using the mutanase-specific 
PCR product as a probe. This yielded 12 positive clones with 
inserts ranging from 0.8 to 2.0 kilobase. These were further 
analyzed by sequencing the ends of the cDNAs with forward 
and reverse pYES polylinker primers, and determining the 
nucleotide sequence of the longest cDNA from both strands 
with synthetic oligonucleotide primers. The nucleotide se- 
quence and the deduced amino acid sequence of the mutanase 
cDNA from T, harzianum are presented in Fig. 1. The 2062-bp 
cDNA clone contains a 1905-bp open reading frame initiating 
with an ATG codon at nucleotide position 29 and terminating 
with a TAG stop codon at nucleotide position 1931, thus pre- 
dicting a 634-residue polypeptide. The open reading frame is 
preceded by a 28-bp 5'-noncoding region and followed by a 
119-bp 3'-noncoding region and a poly(A) tail. 

Cloning of P. purpurogenum Mutanase — Southern blotting 
experiments indicated that 1 the T. harzianum mutanase cDNA 
could be used as a probe to identify .mutanase gene-specific 
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Fig. 1. The nucleotide sequence 
and the deduced amino acid se- 
quence of the «l,3-glucanase ( mu ta- 
na se) cDNA from T, harzianum. The 

signal peptide and propeptide region are 
underlined, and the NH 2 -terminal resi- 
dues determined from the purified, re- 
combinant T, harzianum mutanase are 
indicated by double underlines. The puta- 
tive linker region (rich in Ser, Pro, and 
Thr) flanked by Cys residues at positions 
484 and 553 is highlighted in gray. Non- 
coding sequences are in lowercase letters. 
This sequence has been deposited in the 
Geneseq™ data base with the accession 
number V12368. 



caagcagaatccatctaaaacaccctcaATOTTGGGCGTTGTCCGCCGCCTaSGGCTCQQCGCCCTTGCTGCCOCAGCTCTGTCrTCTCT 
90 

MLGVVRRLGLGALA'AAAIiSSL 

CTK3CAGTGCCGCTCCCGCCAATGTTGCTATTCGGTCTCTCGAGGAACGTGCTTCTTCTGCTGACCGTCTCGTATTCTGTCACTTCATGAT 
180 

GSAAPAHVAI RSLEBR A S fi ADRLVFCHPMI 

TGGTATTGTTGGTGACCGTGGCAGCTCAGCAGACTATGATGATGACATGCAACGTGCCAAAGCCGCTGGCATTGACGCATTCGCTCTGAA 
270 

GIVGDRGSSADYDDDMQRAKAAGIDAFALN 

CATWWCGTTGACX3GCTATACCGACCAGCAACTCGGGTATGCCTA 
360 

IGVDGYTDQQLGYAYD6ADRNGMKVFISPD 

TTTCAACTGGTGGAGCCCCGX3TAATGCAGTTGGTGTTGGCCAGAAGATTGCGCAGTATGCCAGCCGTCCCGCCCAGCTGTATGTTGACAA 
450 

FNWWSPGNAVGVGQKIAQYASRPAQLYVDN 

CCGGCCATTCGCCTCTTCCTTCGCTGGTGACGGTTTGGATGTAAATGCGTTGCGCTCTGCTGCAGGCTCCAACGTTTACTTTGTGCCCAA 
540 

RPFASSFAGDGLDVHAL RSAAGSNVYPVPN 

CTTCCACCCTGGTCAATCTTCCCCCTCCAACATTGATGGCGCCCTCAACTGGATGGCCTGGGATAATGATGGAAACAACAAGGCACCCAA 
630 

FHPGQSSPSNIDGALNWNAWDNDGNNKAPK 

GCCGGGCCAGACTGTCACGGTGGCAGACGGTGACAACGCTTACAAGAATTGGTTGGGTGGCAAGCCTTACCTAGCGCCTGTCTCCCCTTG 
720 

PGQTVTVADGDNAYKNWLGGKPYLAPVSPW 

GTTTTTCACCCATTTTGX3CCCTGAAGTTTCATATTCCAAGAACTGGG7CTTCCCAGGTGGTCCTCTGATCTATAACCGGTGGCAACAGGT 
810 

FFTHFGPEVSYSKNWVFPGGPLIYNRWQQV 

CTTGCAGCAGGGCTTCCCCATGGTTGAGATTGTTACCTGGAATGACTACGGCGAGTCTCACTACGTCGGTCCTCTGAAGTCTAAGCATTT 
900 

LQQGFPMVEIVTWNDYGESHYVGPLKS KHF 

CGATGATGGCAACTCCAAATGGGTCAATGATATGCCCCATGATGGATTCTTGGATCTTTCAAAGCCGTTTATTGCTGCATATAAGAACAG 
990 

DDGNSKHVNDMPHDGFLDLSKPFIAAYKNR 

GGATACTGATATATCTAAGTATGTTCAAAATGAGCAGCTTGTTTACTGGTACCGCCGCAACTTGAAGGCATTGGACTGCGACGCCACCGA 
1080 

DTDISKYVQNBQLVYWYRRNLKALDCDATD 

CACCACCTCTAACCGCCCGGCTAATAAC^MAAGTGGCAATTACTTTATGGGACGCCCTGATGGTTGGCAAACTATGGATGATACCGTTTA 
1170 

TTSNRPANNGSGNYPMGRPDGWQTMDDTVY 

TGTTGCCGCACTTCTCAAGACCGCCGGTAGCXJTCACGGTCA03TCTGGCGGCACCACTCAAACGTTCCAGGCCAACGCCGGAGCCAACCT 
1260 

VAALLKTAGSVTVTSGGTTQTFQANAGANL 

CTTCCAAATCCCTGCCAGCATCGGCCAGCAAAAGTTTGCTCTAACTCGCAACGGTCAGACCGTCTTTAGCGGAACCTCATTGATGGATAT 
1350 

fqi pasigqqkfaltrngqtvfsgtslmdi 
caccaacgtttgctcttgcggtatctacaatttcaacccatatgttggcaccattcctgccggctttgacgaccctcttcaggctgacgg 

1440 

TNVCSCGIYNFKPYVGTI PAGFDDPLQADG 
TCTTTTCTCTTTGACCATCGGATTGCATGTCACGACTTGTCAGGCCAAGCCATCTCrrTGGAACCAACCCTCCTGTCACTTCTGGCCCTGT 

GTCCTCGCTGCCAGCTTCCTCCACCACCCGCGCATCCTCGCCTCCTGTTTCTTCAACTCGTGTCTCTTCTCCCCCTGTCTCTTCCCCTCC 
1620 



B 



AGTTTCTCGCACCTCTTCTCCCCCTCCCCCTCCGGCCAGCAGCACGCCGCCATCGGGTCAGGTTTGCGTTGCCGGCACCGTTGCTGACGG 
1710 _ _ _ 

VAGTVADG 



CGAGTCCGGCAACTACATCGGCCTGTGCCAATTCAGCTGCAACTACGGTTACTGTCCACCGGGACCGTGTAAGTGCACCGCCTTTGGTGC 
1800 

ESGNYIGLCQFSCNYGYCPPGPCKCTAFGA 

TCCCATCTCGCCACCGGCJ^GCAATGGGCGCAACGGCTGCCCTCTACCGGGAGAAGGCGATGGTTATCTGGGCCTGTGCAGTTTCAGTTG 
1890 

PI s PPASNGRNGCPLPGEGDGYLGLCS FSC 

TAACCATAArTACTGCCCGCCAACGGCATGCCAATACTGTTAGgagagagatcaatctcagtatgagtatatggaggctgccgaaggacc 
1980 

NHNYCPPTACQYC* 
agttagctgttcttatcggcagacgaaacccatagagtaagaagttaaataaaatgcaattaatgtgttgtcaaaaaaaaaa 
2062 



fragments in P. purpurogenum genomic DNA (data not shown). 
Consequently, a genomic library was constructed from P. pur- 
purogenum cellular DNA using the bacteriophage vector A- 
ZipLox. Approximately 45,000 plaques from this library were 
screened by hybridization using a segment of the T. harzianum 



mutanase cDNA as the probe. Eighteen positive clones which 
hybridized strongly to the probe were picked and 10 were 
plaque-purified (18) and excised from the A cloning vector using 
the in vivo excision protocol (19). Preliminary restriction map- 
ping on one of the pZLl-mutanase clones (designated pZL- 
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Fig. 2. DNA sequence and deduced 
amino acid sequence of P. purpuroge- 
num mutanase gene. The signal peptide 
and propeptide region are underlined, 
and the NH 2 -terminal residues deter- 
mined from the purified, recombinant P. 
purpurogenum mutanase are indicated by 
double underlines. The putative linker re- 
gion (rich in Ser, Pro, and Thr residues) 
flanked by Cys residues at positions 477 
and 547 is highlighted in gray. Introns 
and noncoding regions are indicated by 
lowercase letters. Consensus lariat se- 
quences (PuCTPuAC) with each intron 
are denoted by a dashed underline. This 
sequence has been deposited in the 
Geneseq™ data base with the accession 
number T89024. 



aattgtgccctaaacctcctectggaggaacacacteaagATGAAAGTCTCCACTGCCT- 
T CG CGG CG ACG CTGTCCGC AATT AT AG C TO C 90 

HKVSSAFAATL flAI I A A 

OTGCTCAGCT C T T CCTT CTGACTCAATQG TTTCG AGGCG AA GC ACAT CGG AC CGT CTCGT GT TCGCG CAT TT C ATGg t a a a c a C c c a t c t 180 
CSALPSPSHVBRR fi t fi n RLVPAHPM 

cgaatatgaggcacatagtcagtgac^atagattg^ctgacttc^^ 2T0 

VGIV9DRTSASDYD 

CG C CGA C ATGC AGQGT GCTAAAGCTTAT GGAATTG A CG C CTTTGC ATTGAAT AT C GGT AC CG ATACC TTC AG CGA C CAGCAACTGGGGTA 36 0 
AOMOGAKAYGIDAPALKI GTDTFSDQQLGY 

TG C CT ACGAGT CTGCGG C AAACAATG AC A TGAAAGTGTT CATTT C AT TCOA TTT C AACT GGTGGTCCAC C AG CCA GG CCA CCGAAATT GG 45 0 
AYE8AANNDMKVFISFDPNWWSTSQATBIG 

CCAAAAGATTGCC CAG T ACGGT AOC CT AC C AGG C C AGCTC ATCT ATG AT G ACAAG ATTTT C G T CTCGT CGT TTG CTGGCG ACGGTGT AGA 54 0 
QKI AQYGSL PGQLMYDDKI F V S S FAGDGVD 

C G7GGCAGC ATTGAAGT CAG CTGCTGG CGG CAA TGTGT T CT TCG CT CC AAACTT CCAT CC A T CGT ATGGT AC AGACC TGT CGG ATG T CGA 63 0 
VAALXSAAGGNVPFAPNFHPSYGTDLSDVD 

TGG T CTTCTCAA CTGG ATGGGCTGG CC TAG CAATG GT AAT AAC AAG GCT C CAA CTGC CGGTGCC AACG7 TAC CG TTC AGGAAGGGGACGA 720 
QLLNWMGWPSNGNNKAPTAGANVTVEBGDE 

GGAATATATAACnxrrTTGGATGGCAAGCCCTACATTGCTCgtcagtcgcctaaccctacctcctagccttggagcaoaacgattcagtt 810 
EYITALDGXPYIA 

tggctjaa-CttttettttttCttCttcacfcagC^^ 900 

PASPWPSTIIFGPBVTYSKHW 

GGTTTTC C C ATCTGAT TTG CTTTT CTAC C AGCG TTGG A ATGATC T A TTG A A TT TQGGC CCTCAATTC ATTGAAGTGGTC ACCTGG AA TGA 9 90 
VPPSDLLPYQRWNDLLNLGPQFI BVVTWND 

CT A TGGTGAAT CGCAAT AT GTC GG A C CT CTGAACT CT CCT C AT ACAGACG ATGGCTCC T C T CGATGGGCG AATGAC A Tgtaagcca t c 1 1 
1080 

YGESQYVGPLNSPHTDDGBSRWANDM 

gtgtaggtatcggtgttttgtttctatactaa^^ 
1170 

PHDGWLDLAKPYI A 

C ATT CCA CGA CGGGGCC ACTTCGCT AT CATCAT C CT A CATCAC CGAAGA C CAG CTC AT CT ACTGGTATCGGC CTC AAC CACG ACTC A TGG 
1260 

AP HDGATSLSSSYITEDQLIYWYRPQPRIjM 

A CTGCGACOCAACT GAT AC CTQC ATGGTTGCTG C CAA C AAT G A CACGGG C AACT ATTT C GAGGGCAG ACC C AATOGG TGGG AAAGC ATGG 
1350 

DCDATDTCNVAANNDTGNYPEGRPNGWESN 

AGG ACG CTGTCTTCGTG GT TQCTTTGCT CCAGT CTGCTGG AACGGTTCAGGT C ACTT C AGGC CCTAATA C CGAC AC ATTTG ATGCT C CTG 
1440 

EDAVPVVALLQSAGTVQVTSGPHTETFDAP 

CT GGTGC AAGCGC CTT C C AGGT T C CCATGGGCT T CGGCCC CCAG AGCTT CTCC CTGTC G CGGGA TGGCG AGA CAGT ATTGT CTGG AAC AA 
1530 

AGASAFQVPMGPGPQSFSLSR DGBTVLSGT 

GCTTGAAGGATATCATTGATGaATGCTrGTGCGGAATCT ACAACTTCAACGCCTATGgt aagaa ctgc cgtgt ct t t tgt a t a tc tgaat 
1620 

E LKDX I DGCLCGI YNFNAY 

atgtttecaaggttattgjocatgggaaaaaaaaaaftaaaattcagT^ 
1710 

VGSt-PATFSDPLBPP 



B 



CTCT CAACGCCTTC AGCGAAGG CTTGAAGGTT TCGACATGC AG CGCGACAC C ATCTTTGGGATTGACATCG AC CACT C C AC CAGAGACCA 
1800 



SLHAFSBGLKVBT 
TTC CT ACAGGCA CG ATT ACTCC AGGA TC AG CTAT TACAGGTGC TGCAA C AACT ACCT CTAC CATCTCGAC C ACCT CCACGATTTCC A CGA 

CCT C AACTTTT AT CTC AACT AC C A CCACCACCACGT CCAOTGCTGCT ACCT CC ACC AC CACCGGAACTTG CATCGCCGG CACTGGC C CTG 

ssfiiisra i * g t g p 

ACAACT AT T CTGGC CTG TGTT C CTTCTGCTG T AACT A CGGC T ACTG TC CGGG CTC CCAT GG TTCG G C CGG C CCGTGT ACATGC ACGGCCT 
2070 

DHYSGLCS PCCKYOYCPGSDG6AGPCTCTA 

ATGGAGAT CCAGT TCCT ACGCCT C C AGT AA C AGG AACAG TT GG CGTTC CG C TTG A TGG CG AGGGTGACAG TT ACTT GOOT CTGTGT AGTT 
2160 

YODPVPTPPVTGTVGVPLDGBGDSYLGLCS 

TTCCCTGC AACCACGGCTATTGCCCGTCTACTGCTTGTCAACTAGAGAGCTGAgaggt gc C act a t C t «gg t aa t a CC a Cgt C aaag t a a 
2250 

PACHHGYCPSTACQVBS * 

tacctaggtactetgtgtctagcetgagagatggcagggtatctagttcta tcttaaatataagatt tctccaacttacatgattttgat 
2340 

gcacatggataggcagacctggacagtgaagggcaacactcaaataacgcaaacagacactggatctatatcgttcaacccagttggcca 
2430 

aagaetagtcgtgaaaaaaacaccctttcgaacaaaaaccttcttcgctgcatcaacgcagccca-aataagtccaatcccctccaccat 
2520 



gaa 2523 
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Fig. 3. a, sequence alignment of the 
catalytic domains of T. harzianum (Triha) 
and P. purpurogenum (Penpu) mutanases 
(Geneseq™ protein data base accession 
numbers W44193 and W32213, respec- 
tively) with the two homologous S. pombe 
ORFs of unknown function (C 14 C4.09 and 
BC646.06c ( GenBank Z98596 and 
AL035216, respectively). Residues identi- 
cal in 3 of the 4 sequences are printed in 
white on black background, b, alignment 
of the mutan-binding domains of T. har- 
zianum (Triha) and P. purpurogenum 
(Penpu) mutanases. Identical residues 
are printed in white on black background. 
The two regions of internal similarity are 
boxed. 
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Pp6A) revealed that the region which hybridized to the T. 
harzianum mutanase cDNA was localized near one end of a 
3.6-kilobase genomic DNA insert (not shown). DNA sequencing 
of a portion of this segment showed an open reading frame with 
clear homology to the T. harzianum mutanase cDNA and its 
deduced amino acid sequence (Fig. 2). The positions of introns 
and exons within the P. purpurogenum mutanase gene were 
assigned based on alignments of the deduced amino acid se- 
quences to the corresponding T. harzianum mutanase gene 
product. On the basis of this comparison, the P. purpurogenum 
mutanase gene is composed of five exons (126, 532, 226, 461, 
and 548 bp) which are punctuated by four small introns (63, 81, 
58, and 78 bp). These appear to be typical fungal introns with 
respect to size and composition in that all contain consensus 
splice donor and acceptor sequences as well as the consensus 
lariat sequence (PuCTPuAC) near the 3' end of each interven- 
ing sequence (29), 

Comparison of Trichoderma and Penicillium Mutanase Pri- 
mary Structures — The signal peptide and propeptide portions 
of P. purpurogenum mutanase and T. harzianum mutanase 
share little amino acid sequence similarity; however, the ma- 



ture polypeptides (after removal of signal and propeptides) are 
approximately 53% identical. The regions of greatest identity 
are located in the NH 2 -terminal portion (residues 42 through 
491; T. harzianum numbering) and over approximately the last 
70 residues of these two proteins where 60 and 63% identity is 
observed, respectively. In both mutanases the NH 2 -terminal 
and COOH-terminal domains are separated by a Pro-Ser-Thr- 
rich linker region. Remarkably, the Pro-Ser-Thr-rich region of 
P. purpurogenum mutanase (residues 475 through 547) is com- 
posed of 69% Pro, Ser, and Thr, and is bordered roughly by Cys 
residues at positions 477 and 547. As the two mutanases ap- 
peared to have a modular structure, sequence comparisons 
using the BLAST algorithm to search the non-redundant Gen- 
Bank CDS translations on the NCBI server (30) were therefore 
conducted on each domain separately. BLAST searches using 
the NH 2 -terminal domains did not produce any hits with 
known glycosidases, however, two ORFs of unknown function 
in Schizosacckaromyces pombe (C 14 C4.09 and BC 646. 06c, Gen- 
Bank Z98596 and AL035216, respectively) were picked with 
highly significant scores (E values ranging from 6 x 10" 50 to 
10 ~ 26 ) suggesting that these ORFs encode similar glycosidases. 
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Fig. 4. Schematic maps of the vectors pMT1796 and pJeRS35 for the heterologous expression of T. harzianum and P. purpura- 
genum mutanases in A oryzae (not to scale). 



An alignment of the sequences of the catalytic domains of the 
two mutanases with the two ORFs of S. pombe is shown in Fig. 
3a. BLAST searches conducted with the COOH-terminal do- 
mains of the mutanases also failed to produce any significant 
hit in GenBank. Within the COOH-terminal domains of the 
two mutanases, two short regions display intriguing similarity 
(10 residues conserved out of 15) suggesting the existence of an 
internal duplication (Fig. 36). 

Heterologous Expression of T. harzianum Mutanase in A. 
oryzae — The T. harzianum mutanase coding region was ampli- 
fied by PCR, and the amplicon was inserted into an Aspergillus 
expression vector so that the gene was under the control of an 
A, orzyae a- amylase gene promoter and an A. niger glaA ter- 
minator. The resulting expression construct, pMTH1802, was 
further modified by changing aa 35 from Glu to Lys resulting in 
the presence of a dibasic (KEX2-type) processing site at the 
amino terminus of the mature mutanase protein. This new 
expression vector, pMT1796 (Fig. 4a), was used to transform an 
A. oryzae strain, and 25 independent transformants were iso- 
lated. Mycelia from each transformant was used to inoculate 
20-ml culture tubes containing 10 ml of YPM media and cul- 
tures were grown with shaking for 5 days at 30 °C. SDS-PAGE 
analysis revealed a dominant 85-90-kDa band indicating that 
these transformants were indeed expressing the recombinant 
mutanase gene. 

Heterologous Expression of P. purpurogenum. Mutanase — 
The P. purpurogenum mutanase coding region was amplified 
by PCR using primers that created 5'- and 3 '-terminal restric- 
tion sites compatible with an Aspergillus expression vector. 
The amplified DNA segment was subsequently inserted into 
the vector which employed a strong A. oryzae a-arnylase gene 
promoter. The resulting plasmid, designated pJeRS35 (Fig. 46), 
was used to transform an A. oryzae recipient strain, and 40 
transformants were isolated. Mycelia from each of the trans- 
formants were used to inoculate shaker flask cultures that 
were incubated for 3 days. Using a mutan agar plate assay, 14 
of the transformants showed extracellular mutanase activity as 
indicated by opaque clearing zones (the control showed no 
clearing zone). Broth samples that were positive in the plate 
assay were subsequently analyzed by SDS-PAGE. These trans- 
formants showed a prominent band at approximately 90 kDa. 

Purification of and Molecular Properties of Recombinant Mu- 
tanases — Recombinant T. harzianum mutanase was purified in 
a three-step procedure using cation -exchange chromatography, 
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Fig. 5 SDS-PAGE 4-20% (Novex). a, low molecular mass stand- 
ard {lane i); purified rec. T. harzianum mutanase (lane 2). b, purified 
recombinant P. purpurogenum mutanase (lane 1), low molecular mass 
standard (lane 2). 

anion-exchange chromatography followed by size exclusion 
chromatography resulting in a yield of around 24%. The essen- 
tially pure mutanase exhibited a molecular mass around 86 
kDa (Fig, 5a). A rather broad band was observed indicating 
some heterogeneity and/or heavy glycosylation. The NH 2 - ter- 
minal amino acid sequence was determined by protein sequenc- 
ing to be Ala-Ser-Ser- thus predicting a calculated molecular 
mass of 63.8 kDa for the mature enzyme (Table I). This obser- 
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Table I 



Molecular properties of purified recombinant mutanases 



Enzyme 


NH 2 - terminal 
sequence 


Starting at 
residue 
no. 


m in 
(calculated) (SDS) 


Pi 

(calculated) 


T. harzianum 
P. purpurogenum 


Ala-Ser-Ser- 
Ser-Thr-Ser- 


38 
31 


63,785 Da 86 kDa 
63,557 Da 90 kDa 


5.24 
3,88 



vation suggests that the first 37 amino acid residues deduced 
from the gene sequence function as a secretory signal peptide 
and propeptide. This is supported by the fact that the NH 2 
terminus of the mature mutanase is not preceded by a typical 
signal peptidase cleavage site (31 } Fig. 1) but rather by a 
cleavage site for a monobasic processing enzyme. Furthermore, 
the mutanase cDNA encodes an apparent signal sequence of 24 
amino acids, with a predicted signal peptidase cleavage site 
between Ala-24 and Ala-25 in the mutanase precursor (31). A 
simpler procedure for purification of the recombinant P. pur- 
purogenum 1 mutanase was established using the information 
that a putative COOH-terminal-binding domain is present in 
the enzyme. The enzyme was adsorbed to insoluble mutan and 
subsequently eluted in water. This procedure resulted in a 
129-fold purification and a yield around 20%. The essentially 
pure mutanase had a molecular mass of about 90 kDa (Fig. 56). 
NH 2 -terminal amino acid sequencing revealed the following 
sequence: Ser-Thr-Ser-Asp-Arg-. Thus, the deduced amino acid 
sequence of the mutanase gene product (Fig. 2) predicts an 
amino-terminal extension of 30 amino acids which are not 
present in the mature enzyme and a molecular mass for the 
mature enzyme of 63.6 kDa (Table I). Based on the rules of von 
Heijne (31), the first 20 amino acids likely comprise a secretory 
signal peptide, and the next 10 residues probably represent a 
propeptide segment which is removed by a subsequent proteo- 
lytic cleavage following the dibasic Arg-Arg sequence. 

Characterization of the Purified Recombinant Mutanases — 
The two mutanases showed similar catalytic properties. They 
both exhibit slightly acidic pH optima in the range from pH 3.5 
to 5.0 and pH 3.0 to 4.5 for the T. harzianum and P. purpuro- 
genum mutanases, respectively. At pH 5.5 the two enzymes 
have specific activities of 16 and 12 MU/mg, respectively, on 
insoluble mutan at 40 °C. Also, the two mutanases have virtu- 
ally identical temperature optima around 50-55 °C at pH 5.5. 
This correlates with DSC analysis of the thermal stability of 
the recombinant T. harzianum mutanase which shows a mid- 
point denaturation temperature (T M ) around 56 °C at pH 5.5 
identical to that of the wild-type enzyme. 

The binding properties of the two mutanases toward insolu- 
ble mutan were investigated at steady state conditions at pH 7 
and 4 °C in order to limit hydrolysis. The kinetics of adsorption 
was followed by taking samples from the supernatant of mu- 
tanase incubated with mutan. The equilibrium was reached 
within 5 min, and then no further net adsorption was observed 
(data not shown). Varying concentrations of mutanase were 
incubated for 1 h under the above conditions with mutan and 
the amount of free mutanase was determined by fluorescence 
spectroscopy (concentrations verified by activity analysis) and 
the amount of bound enzyme was calculated. Thus, binding 
isotherms were generated, and the data fitted using the simple 
Langmuir model for adsorption to a surface (Fig. 6). Rather 
strong binding was observed with desorption constants of 0.13 
and 0.11 p,M for the T. harzianum and P. purpurogenum mu- 
tanases, respectively (Table II). A significant difference is ob- 
served in the maximum level of enzyme which can be adsorbed 
to the insoluble mutan 0.549 versus 0.244 /xmol of enzyme/g of 
mutan for the T. harzianum and P. purpurogenum mutanases, 
respectively (Table II). 

In order to probe the hypothesis that the homologous COOH- 




0.00 0.25 0.50 0,75 1.00 

Free Enzyme (uM) 

Fig. 6, Substrate binding isotherms of purified recombinant 
mutanases; 0.2% mutan in 0.1 m sodium phosphate, pH 7, 4 °C. a, 

recombinant 7 1 . harzianum mutanase; b, rec. P. purpurogenum 
mutanase. 

Table II 



Substrate binding properties of purified recombinant mutanases 



Enzyme 


K 


A 

■"max 






fimol/g mutant 


T. harzianum 


0.129 ± 0.021 


0.549 ± 0.047 


P. purpurogenum 


0.111 ± 0.016 


0.244 ± 0.012 



terminal domain of the two fungal mutanases constitutes a 
mutant-binding domain, the T. harzianum mutanase was sub- 
jected to limited proteolysis using chymotrypsin. The protease 
treatment resulted in a 41-kDa band on SDS-PAGE (Fig. 7), 
which was NH 2 terminally sequenced after being electroblotted 
onto a polyvinylidene difluoride membrane revealing the se- 
quence Ser-Leu-Thr-Ile-Gly-Leu- corresponding to proteolytic 
cleavage between Phe-473 and Ser-474. This strongly suggests 
that the 41-kDa band corresponds to the linker and the COOH- 
terminal domain. The chymotrypsin digest of the mutanase 
was then incubated with 2.5% mutan before centrifuging the 
sample and loading the supernatant onto SDS-PAGE. From the 
SDS-PAGE analysis (Fig. 7) it is apparent that the 41-kDa 
band has been adsorbed to the insoluble mutan since it is no 
longer present in the supernatant. 

DISCUSSION 

Nucleic acid sequences encoding extracellular mutanases 
from the filamentous fungi T. harzianum and P. purpurogenum 
were cloned and successfully expressed in A. oryzae. The pri- 
mary translation products of these two DNA sequences appear 
to be preproenzymes, having both NH 2 -terminal signal pep- 
tides and propeptides that are removed post-translationally. 
The two mutanases show deduced amino acid sequence identi- 
ties of 53% overall. Further analyses of the protein sequences 
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Fig. 1. SDS-PAGE 4-20% (Novex); 2.5% mutan + chymotrypsin 
digest of T. harzianum mutanase {lane I), low molecular mass 
standard (lane 2), chymotrypsin digest of T. harzianum muta- 
nase (lane 3), 

reveal stronger similarity between the NH 2 - terminal and 
CO OH- terminal parts of the mature enzymes, separated by a 
less homologous Pro, Ser, and Thr-rich region. Consequently, 
like many cellulases, glucoamylases, and chitinases, the ma- 
ture mutanases appear to be made of two distinct domains: a 
NH 2 -terminal catalytic domain, and a putative COOH-termi- 
nal polys accharide-binding domain separated by a Pro-Ser- 
Thr-rich linker peptide. MALDI-MS and carbohydrate analysis 
of the wild-type enzyme from T. harzianum suggest that the 
linker region is O- glycosylated in a manner similar to the 
Ser-Thr-rich linker region of A. niger glucoamylase (32). The 
glycosylation is even more pronounced in the recombinant en- 
zymes which display molecular masses of 86 and 90 kDa for the 
T, harzianum and P. purpurogenum mutanases, respectively. 
Fungal polysaccharidases harboring a Pro-Ser-Thr-rich linker 
separating the catalytic from the carbohydrate-binding domain 
have long been known to undergo hyperglycosylation upon 
expression in yeast and other heterologous fungal systems (33). 

Experiments showing that the chymotrypsin produced frag- 
ment of the T. harzianum mutanase adsorbs to mutan gave 
indirect indication that the COOH-terminal domain of the two 
fungal mutanases is responsible for binding to insoluble mu- 
tan. As a first step in an effort to further characterize the 
COOH-terminal mutan-binding domain of T, harzianum mu- 
tanase, two expression plasmids were constructed harboring (i) 
an internal deletion of the coding region encompassing residues 
32-542 (i.e. coding for the isolated COOH-terminal binding 
domain without any linker) and (ii) the NH 2 -terminal catalytic 
domain only. The transformants were tested by immunodiffu- 
sion using antibodies raised against the whole mutanase. 
Whereas the isolated catalytic domain was unaffected by pre- 
incubation with mutan, the first transformant became negative 
upon preincubation with mutan (data to be described else- 
where). The inability of the catalytic domain to bind to mutan 
was verified by activity anaylsis showing that no activity could 
be removed from the supernatant by preincubation with insol- 
uble mutan. 

Glycoside hydrolases and transglycosidases have been clas- 
sified in a number of distinct families based on amino acid 
sequence similarities (34-36). BLAST searches (30) conducted 
with the NH 2 -terminal catalytic domains of the two mutanases 
described here failed to display any similarity with known 
glycosidases from previously defined families. Families of gly- 
coside hydrolases being defined with at least two related se- 
quences (34), the two mutanases therefore allow the definition 
of a new family (designated family 71). Although, sequence 
similarities between the mutanase catalytic domains and the 
described ORFs of S. pombe are such that it is predictable that 
all these proteins adopt a similar fold and operate via the same 



catalytic mechanism using a similar catalytic machinery (37- 
39), the precise substrate specificity of the S. pombe proteins 
cannot be reliably ascertained as it has been shown that se- 
quence-based families of glycosidases contain enzymes with 
sometimes widely different substrate specificity (34). Finally, it 
is worth mentioning that, unlike the two mutanases, none of 
the two S. pombe ORFs carries a COOH-terminal extension 
suggesting that the encoded proteins are made of a single 
domain. 

Cellulases, xylanases, chitinases, and starch-degrading en- 
zymes have long been recognized to have a modular structure 
with a catalytic domain carrying one or several ancillary mod- 
ules whose function is often binding to insoluble polysacchar- 
ides (40). The best described of these ancillary modules are 
probably the cellulose-binding domains which have been clas- 
sified in several distinct families based on sequence similarities 
(41, 42). The lack of sequence similarity of the two COOH- 
terminal domains of the mutanases with any known carbohy- 
drate-binding domains together with their insoluble mutan 
binding activity allows the definition of a new family of carbo- 
hydrate-binding modules. 

The pH optimum observed for the two mutanases is not 
exactly in agreement with the earlier reported pH optimum 
around pH 6.0 for the T. harzianum mutanase (7) but compa- 
rable to the pH optimum obtained for the Trichoderma viride 
(43) and the Penicillium funiculosum mutanases (3), For com- 
parison, the bacterial mutanases from Bacillus circulans (44) 
and Streptomyces chartreusis (45) have slightly higher pH op- 
tima than the two fungal mutanases but similar temperature 
optima. Although, the pH in the oral cavity is around pH 6-7, 
the slightly acidic pH profile of the two fungal mutanases may 
be of importance in the application for plaque removal as low 
pH values have been observed locally in the plaque (46). 

The substrate binding constants observed for the two muta- 
nases to insoluble mutan are, although slightly higher, in the 
range of reported binding constants for cellulase adsorption to 
insoluble cellulose (27). The difference in the maximum binding 
capacity observed for the two mutanases may be explained by 
differences in the batches of mutan used for the experiment as 
these have been found to vary somewhat in quality/purity. 
Alternatively, a possible explanation would be that the T. har- 
zianum mutanase is capable of dispersing the insoluble mutan 
(in analogy to cellulose-binding domains and cellulose) to a 
larger extend than the P. purpurogenum mutanase and thus 
revealing a larger surface area onto which the enzyme can 
adsorb. The strong adsorption of the fungal mutanases may be 
beneficial for their application in dentrifice as the enzymes are 
expected to bind to dental plaque and thus be retained in the 
oral cavity where it is supposed to exhibit its action in removing 
the dental plaque. 
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Abstract The hemA gene encoding 5-aminolevulinate 
synthase, the first enzyme in heme biosynthesis, was 
cloned from Aspergillus oryzae and evaluated as a se- 
lectable marker for the transformation of filamentous 
fungi. Deletion of the hemA gerie resulted in a lethal 
phenotype that could be rescued either by the supple- 
mentation of culture media with 5-aminolevulinic acid 
(ALA) or by transformation with the wild-type hemA 
gene, but not by growth on rich media, nor by the ad- 
dition of exogenous heme. Transformation of a hemA 
deletion strain with the hemA gene linked to a lipase 
expression cassette yielded ALA prototrophs expressing 
lipase. The hemA gene can therefore be used as a 
selectable marker for the transformation of A. oryzae. 

Key words hemA • Heterologous gene expression • 
Heme biosynthesis * Recombinant enzyme 



Introduction 

Filamentous fungi are widely used as efficient hosts for 
protein production, secreting some proteins at up to 
30 g/1 (Ward 1989). Harnessing these hosts for heterol- 
ogous protein synthesis requires the introduction of a 
transgene, together with a marker gene to allow selec- 
tion. Selectable markers can be grouped into three broad 
classes based on provision of an ability to: (1) grow in 
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the presence of an antimicrobial substance, (2) synthe- 
size a required nutrient, or (3) utilize a unique nitrogen 
or carbon compound for growth (for reviews on trans- 
formation and selection systems in fungi, see Esser and 
Mohr 1986; Cullen and Berka 1987; Timberlake and 
Marshall 1989; Gwynne and Devchand 1992). While 
these systems function effectively for the initial selection 
of transformed DNA, they require the addition of an 
antibiotic or the use of a defined growth medium for 
continuous selection during large-scale fermentation. 
Consequently, an alternate selection system functioning 
in rich media without costly additives was sought, Since 
commercially useful filamentous fungi, such as Asperg- 
illus and Fusarium, are obligate aerobes, enzymes in- 
volved in respiration were chosen as targets for the 
development of such a system. 

Heme, a cofactor required for functional respiratory 
cytochromes, is synthesized in an eight-step pathway 
that begins with the condensation of glycine and suc- 
cinyl-CoA to form 5-aminolevulinate (ALA). The gene 
encoding 5-aminolevulinate synthase (ALAS), the en- 
zyme catalyzing this reaction, has been cloned from 
a number of organisms including humans (Borthwick 
et al. 1984; Bawden et al. 1987), A. nidulans (Bradshaw 
et al. 1993) and Saccharomyces cerevisiae (Arrese et al. 
1983; Urban-Grimal et al. 1986). Deletion of the 
HEM I gene encoding ALAS in S. cerevisiae is lethal, 
but cells can be rescued by the addition of exogenous 
ALA or heme, or by reintroduction of the wild-type 
gene (Urban-Grimal and Labbe-Bois 1981; Arrese 
etal. 1983; Bard and Ingolia 1984; Keng et al. 1986; 
Volland and Urban-Grimal 1988). HEM I has 
previously been used as a selectable marker in large- 
scale yeast fermentations, where oxygen limitation 
helped to maintain plasmid selection (Bard and Ingolia 
1990). 

Here we describe the cloning of the gene encoding the 
first enzymatic step in heme biosynthesis and its evalu- 
ation as a selectable marker in A. oryzae. 
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Materials and methods 

Isolation of the A. oryzae hemA gene 

The A. oryzae hemA gene was cloned by screening an A. oryzae 
genomic library with a probe generated by PCR amplification of 
the A. nidulans hemA gene. The genomic library was constructed 
by cloning Tsp5§9 I-cut DNA fragments (4-kb to 7-kb) into the 
A-ZipLox cloning vector (Life Technologies, Bethesda, Md.). The 
probe was prepared by PCR amplification of A. nidulans hemA 
DNA {A. nidulans strain FGSC26) under conditions recommended 
in the DIG high prime DNA labeling and detection starter kit II 
(Boehringer Mannheim, Indianapolis, Ind.) using oligonucleotides 
ALAS3d S'-TTTATGATGGAGGCCCTTCTCCAGCAGTCTC- 
3' and ALAS4e 5'-CTATGCATTTAAGCAGCAGCCGCGAC- 
TGG-3'. Bacteriophage DNA from 7 x 10 4 plaques was transferred 
to duplicate circular nylon membranes (Nytran Plus, Schleicher & 
Schuell, Keene, N.H.) and hybridized with a digoxigenin (DIG)- 
labeled A. nidulans hemA probe. Membranes were hybridized at 
42 °C in 5 x SSC, 0.1% Sarkosyl, 0.02% SDS, 1% Genius block- 
ing agent (Boehringer Mannheim, Indianapolis, Ind.) and 30% 
formamide. Membranes were washed at room temperature twice in 
5 x standard saline citrate (SSC) with 0. 1 % sodium dodecyl sulfate 
(SDS), followed by two washes in 2 x SSC, 0.1% SDS. Five clones 
were identified and excised into pZL derivatives according to the 
manufacturer's instructions (Life Technologies, Bethesda, Md.). 
These clones were found to overlap and span a 4.2-kb region 
containing the hemA gene (GenBank accession number AF 152374). 
Sequencing was performed using a Perkin Elmer 377 automated 
sequencer using Big Dye chemistry (Perkin Elmer, Foster City, 
Calif.). 



Creation of hemA deletion plasmid pSE52 

Plasmid pSE17, a pZL derivative containing 3175 bp of hemA ge- 
nomic insert [523 bp of upstream sequence, the hemA open reading 
frame (ORF) of 1911 bp and 722 bp of downstream sequence 
(Fig. 1)] was the starting point for the construction of a hemA dele- 
tion plasmid. A 4.1-kb A. oryzae pyrG fragment (including the pyrG 
ORF) was inserted into the EcoRI site olhemA in pSE17, positioning 
pyrG in the 3' flanking region of hemA . The resulting plasmid was 
restricted with £co47III and MluNl and re-ligated to create the he- 
rn A A:: pyrG allele. The deletion allele was then PCR-amplified using 
the Expand PCR kit (Boehringer Mannheim, Indianapolis, Ind.) 
with primers SE48up 5'-AATGGTCAAAACTGGCTCCTAC-3' 
and SE48dwn 5'-TGTACCTGTTCTTGGGCTGTC-3' and subcl- 
oned into pCR2.1 (Invitrogen, Carlsbad, Calif.) to create pSE52 
(Fig. 2). A linear 6.3^kb fragment containing the hem Ah:: pyrG allele 
could then be easily generated by restriction with Sad and Notl. 



Creation of hemA-Mpase expression plasmid pSE54 

Plasmid pSE54 was constructed by PCR amplification of hemA 
with primers SMupl 5'-GC TCTAGA TACCTGTTCTTGGGC- 
TGTGAC-3' and SM + term 5'-GC TCTAGA TGGCCCCTT- 
CTATTGTTATTA-3' from pSE17. The amplified product had 



Xbal sites at either end (underlined in the primer sequence), the 
hemA gene, and 491 bp of the native promoter. This Xbal fragment 
was inserted into pBANe8, a fungal expression plasmid containing 
the Thermomyces languinosus lipase gene (Boel and Huge- Jensen 
1989) under the control of the Taka-amylase A gene promoter 
(Tada et al. 1991) to create plasmid pSE54 (Fig. 2). 



Media and reagents 

Minimal medium contained (per liter): 6 g NaN0 3 , 0.52 g KC1, 
1.52 g KH 2 P0 4 , 1 ml Cove trace elements (0.04 g Na 2 B4O7-10H 2 O, 
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Fig. 2 Plasmid maps of the hemAr.pyrG deletion plasmid, pSE52, 
and the hemA-Mpase expression plasmid. Positions of relevant genes 
are indicated, ORFs are shown by arrows and the closed triangle 
indicates the location of the hemA deletion in pSE52. Nucleotide 
positions of relevant restriction sites are shown in parentheses 



Fig. 1 Restriction map of the 
Aspergillus oryzae hemA gene 
hemAr.pyrG deletion allele, 
showing the hemA genomic 
region in plasmid pSE17. The 
arrow indicates the putative 
hemA open reading frame 
(hemA ORF) 
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0.4 g CuS0 4 -5H 2 0, 1.2 g FeS0 4 -7H 2 0, 0.7 g MnS0 4 , 0.8 g Na 2 . 
Mo0 2 -2H 2 0 and 10 g ZnS0 4 -7H 2 0 per liter), 25 g Noble agar, 
1 M sucrose, 20 ml 50% glucose and 2.5 ml 20% MgS0 4 , adjusted 
to pH 6.5. YEG contained (per liter): 5 g yeast extract, 20 g glucose 
and 1 M sucrose. MY25 medium contained (per liter): 1 x MY 
salts [2 g MgS0 4 -7H 2 0, 2 g K 2 P0 4 , 10 g KH 2 P0 4 , 2 g citric acid, 
0.5 ml AMG trace elements (13.9 g FeS0 4 -7H 2 0, 8.5 g MnS0 4 - 
H 2 0, 14.28 g ZnS0 4 -7 H 2 0, 1.63 g CuS0 4 , 0.24 g NiCl 2 -6 H 2 0 
and 3.0 g citric acid per liter) and 1 ml 10% CaCl 2 -2H 2 0], 1% 
yeast extract, 2.5% maltose and 0.2% urea, at pH 6.5. Quarter- 
strength MY25 contained 1 x MY salts and 0.25% of the yeast 
extract, maltose and urea concentrations from the above. 

Hemin (Sigma Chemical Co., St. Louis, Mo.) and ALA (Por- 
phyrin Products, Logan, Utah) were prepared immediately prior to 
use as stock solutions in 50 mM NaOH and water, respectively. 



Fungal strains and transformations 

Fungal transformations were performed as previously described 
(Christiansen et al. 1988) with the following exceptions: Novozyme 
234 (Novo Nordisk, Bagsvaerd, Denmark) was used in a final 
concentration of 10 mg/ml during protoplasting without BSA, and 
1 ml of SPTC [0.8 M sorbitol, 40% polyethylene glycol 4000 
(BDH), 50 mM Tris-HCl and 50 mM CaCl 2 , pH 8.0] was used 
instead of STC. Transformations of hemA deletion strain SE29-70 
were performed using 10 ug of the indicated circular plasmid DNA 
(pSE17 or pSE54), and the protoplasts were plated onto minimal 
medium for growth at 34 °C. 

The hemA deletion strain was created by transformation of 
PyrG* A. oryzae strain HowB425 (Brody et al. 1998) with 
a hemAA::pyrG deletion allele. The linear 6,3-kb Sacl-Notl 
hemAD::pyrG fragment from plasmid pSE52 was isolated by 
agarose gel electrophoresis and used in the transformation reac- 
tion. Transformants were selected on minimal medium containing 
either 2.5 mM ALA or 5 mM ALA. ALA auxotrophy was 
determined by assessing growth of primary transformants on 
minimal medium lacking ALA at 34 °C. Strains showing marginal 
growth under these conditions were streaked for single colony 
isolation on minimal medium supplemented with ALA. Several 
strains were confirmed as hemA deletants by Southern analysis 
and one strain, SE29-70, was used for all subsequent transfor- 
mations. 



Southern hybridization analysis 
Genomic DNA isolations 

Genomic DNA was isolated as previously described (Wahleithner 
et al. 1996). For all Southern blots, 10 ug of restricted DNA was 
electrophoresed and transferred either to Nytran Plus (Schleicher & 
Schuell, Keene, N.H.) or to Hybond N (Amersham, Piscataway, 
N.J.) using 0.4 N NaOH in a TurboBlot apparatus, according to 
the manufacturer's instructions (Schleicher & Schuell, Keene, 
N.H.). Nylon membranes were rinsed in 2 x SSC after transfer, air- 
dried and UV cross-linked before hybridization with the indicated 
probes. 



Probes 

The hemA probe was generated by PCR amplification from pSE17 
using primers hemAdelupl 5'-AGGCCTCTTGGGTTATG- 
AATG-3' and hemAdeldwnl 5'-TGACCTGGAGATTAGACA- 
TAG-3'. This generated a 502-bp probe positioned between the 
BamHl and Eco47Ul restriction sites in the hemA sequence (see 
Fig, 1). This probe was either labeled with DIG-labeled dUTP in a 
PCR reaction performed using a DIG-label PCR kit (Boehringer 
Mannheim, Indianapolis, Ind.) or with (a- 32 P)-dCTP by random 
priming using the Prime-It II kit (Strata gene, La Jolla, Calif.). The 
radioactively-labeled probe was purified using a G50 midi column 



(5'-3' Boulder, Col.), PCR products were purified using either 
QiaQuick Spin Columns (Qiagen, Valencia, Calif.) or GenElute 
columns (Supelco, Bellefonte, Pa.) and denatured prior to use. 



Hybridizations 

Membranes were prehybridized at 42 °C for 6 h in 15 ml of hy- 
bridization solution [5xSSC, 0.1% sarkosyl, 0.02% SDS, 1% 
Genius blocking agent (Boehringer Mannhiem, Indianapolis, Ind.) 
and 50% formamide]. Denatured probe [1 ng DIG-labeled probe/ 
ml hybridization solution or 1 x 10 7 cpm of ( 32 P)-labeled probe] 
was added to 10-15 ml of hybridization solution. Hybridizations 
were carried out for 1 6 h at 42 °C. The blot was washed twice 
at room temperature for 5 min each time in 2 x SSC, 0,1% SDS, 
then washed twice at room temperature for 5 min each time in 
0.2 x SSC 0.1 % SDS and finally washed twice at 68 °C for 15 min 
each time in 0.1 x SSC, 0.1% SDS. The washed membranes were 
rinsed in 2 x SSC and were then exposed to Kodak X-OMAT AR 
film, followed by development using a Konica QX-70 automatic 
film processor. 



Heterologous expression of lipase 

Colonies growing in the absence of ALA on primary selection 
plates were used to inoculate 1 ml of quarter-strength MY25 me- 
dium in 24-well microtiter plates. Plates were incubated at 34 °C, 
shaking at 200 rpm, and culture broth samples were removed on 
days 4 and 7 for lipase enzyme assays. Expression was further 
evaluated using 25 ml of MY25 medium in 125-ml plastic flasks 
shaken at 200 rpm in a 34 °C room for 4 days. 



Lipase enzyme assays 

Assays were performed in microtiter plates, following dilution of 
the samples using 0.02% alpha olefin sulfonate and 100 mM Mops 
at pH 7.5, following the hydrolysis of p-nitrophenyl butyrate sub- 
strate (1.3 mM p-nitrophenyl butyrate in dimethyl sulfoxide, 4 mM 
CaCl 2 -2H 2 0 and 100 mM Mops, at pH 7.5) at 405 nm. Relative 
activity was calculated based on comparisons to known purified 
lipase stock solutions. 



Results 

Cloning of the hemA gene 

The A. oryzae hemA gene was cloned from a genomic 
library that was screened using the A. nidulans hemA 
gene (Bradshaw et al. 1993) as a probe. Sequence anal- 
ysis of several overlapping clones revealed a single 
contiguous ORF of 1911 nucleotides (Fig. 1). The cod- 
ing sequence of hemA does not contain any introns, in 
contrast to the A. nidulans hemA gene which contains 
one intron near the 5' end (Bradshaw et al. 1993). 
Southern blot analysis using low stringency hybridiza- 
tion conditions demonstrated it to be a single copy gene 
(data not shown). The 5' untranslated sequence contains 
several pyrimidine-rich and AT-rich regions similar to 
other fungal genes (Gurr et al. 1987; Unkles 1992), as 
well as a CCAAT motif at position -249 (relative to the 
initiator ATG = + 1). Also, an (AC) 35 repeat motif oc- 
curs in the 3' untranslated region approximately 40 nu- 
cleotides after the stop codon. Similar repeats have been 
observed in subtelomeric, intron and promoter regions 
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of mammalian and S. cerevisiae genes but have no 
known function, although they have been implicated in 
gene amplification events (Passananti et al. 1987). 
Microsatellites such as these have also been reported 
as abundant sequences in the human, rat and tropical 
tree genomes (Condit and Hubbell 1991; Beckman and 
Weber 1992; Dib et al. 1996). 

The nucleotide sequence of hemA encodes a predicted 
protein of 636 amino acids with a molecular weight of 
68 kDa. Overall, the deduced amino acid sequence of 
the A. oryzae ALA synthase shares 81% identity with 
the A. nidulans homologue, 53% identity with the 
S. cerevisiae protein, and 37% identity with the human 
erythroid ALAS (Fig. 3). Sequences implicated in cata- 
lytic activity and pyridoxal phosphate co-factor binding 
(Ferreira and Gong 1995), characterized by an arginine- 



and glycine-rich loop (Fig. 3), are conserved. The first 35 
amino acids of the A. oryzae protein are rich in serine, 
threonine, lysine and arginine, consistent with a function 
as a mitochondrial localization sequence (Fig. 3). Also, 
two potential heme regulatory motifs (HRMs) with 
the tri-peptide sequence CPV/F, occur in the predicted 
protein sequences from both the A. nidulans and 
A. oryzae genes; one in the presumed mitochondrial lo- 
calization sequence and one in the beginning of the 
putative mature protein (Fig. 3). 

Deletion of the hemA gene 

A marked deletion allele, hemAA::pyrG, was constructed 
(see Materials and methods and Fig. 2) and used to 



Fig. 3 hemA amino acid se- 
quence alignment of 5-amino- 
levulinate synthase (ALAS) 
amino acid sequences from 
A. oryzae, A. nidulans, human 
erythroid (ALAS 2) and Sac- 
charomyces cerevisiae. Differing 
amino acid residues are shaded 
in black. Conserved heme regu- 
latory motifs (HRM; CPV/F), 
catalytic domain (conserved 
arginine) and glycine-rich loop 
(YGAGAGGTRNI) are boxed. 
An additional HRM sequence 
exists in the human sequence 
(underlined) which is not present 
in the fungal sequences. Se- 
quences were aligned by the 
Clustal method with a gap 
penalty of 10, using MegAlign 
software (DNASTAR, Madi- 
son, Wis.) 
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transform a PyrG - A. oryzae strain. Transformants were 
plated onto minimal medium agar plates containing 
ALA to select for pyrG prototrophs while supporting 
growth of putative hemA -deleted strains. Screening of 
240 primary transformants on minimal medium lacking 
ALA identified 11 strains with very poor growth on 
minimal medium. Two rounds of spore purification were 
required to generate strains that showed absolutely no 
growth on minimal medium. Southern hybridization of 
genomic DNA from five of these strains showed that one 
contained both a wild-type and a hemAA:;pyrG allele, 
while four contained only the hemAA::pyrG allele (data 
not shown). Of these four, one strain (SE29-70) was 
chosen for all subsequent transformations. Confirma- 
tion that strain SE29-70 contains only the deletion allele 
is shown in Fig. 4, lane 1 . 

Rescue of the lethal deletion phenotype 

Concentrations of ALA as low as 30 jiM (5.0 ug/ml) 
were sufficient to rescue the ALA auxotrophic pheno- 
type when added to minimal medium agar. Hemin 
added to minimal medium in concentrations ranging 
from 0.05 mg/ml to 0.2 mg/ml had no inhibitory effect 
on the growth of wild-type strains, yet was unable to 
support growth of any deletion strain tested. Additional 
supplementation of minimal medium with 0.1 jig vita- 
min Bi 2 /ml and/or 20 jag ergosterol/ml had no effect. 
The hemA deletion strains could not be maintained on 
rich medium such as MY25, indicating that media 
containing as much as 1 % yeast extract do not contain 
sufficient ALA to rescue hemA auxotrophs. This sug- 
gests that nutrient-rich conditions used in commercial 
fermentations may be selective for the maintenance of 
hemAAinked expression vectors. 

Complementation of hemA deletion by transformation 
with the hemA gene 

Protoplasts of the hemA -deficient strain were trans- 
formed with plasmid pSE17 and plated on minimal 
medium lacking ALA. Transformed colonies were ap- 
parent after 2 days at 34 °C. Transformation efficiencies 
were similar to those observed using other fungal se- 
lectable markers, with a range of 5-30 transformants/ jig 
DNA in independent transformations with different 
protoplast preparations. Control transformations with 
no DNA showed virtually no background colony for- 
mation. Most transformants contained a single copy of 
the hemA gene as shown in Fig. 4 (lane 3). However, 
one-third of transformed strains showed 1-3 extra cop- 
ies of the hemA gene (Fig. 4, lanes 4, 5). All transfor- 
mants tested were ALA prototrophs, as demonstrated 
by their ability to grow on minimal medium without 
ALA. These data suggest a correlation between the 
restoration of ALA prototrophy and the presence of at 
least one copy of the wild-type hemA gene. 
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Fig. 4 Southern hybridization of hemAA:;pyrG strains transformed 
with wild-type hemA, BamHl digested genomic DNA was electro- 
phoresed, transferred to a nylon membrane and hybridized with a 
hemA probe. Lane 1 hemA deletion strain (SE29-70), lane 2 wild-type 
A. oryzae, lanes 3-5 contain deletion strains transformed with plasmid 
pSE17. The wild-type A. oryzae strain shows the predicted 1.6-kb 
wild- type hemA band, while the hemA deletion strain shows a 1 . 1 -kb 
band predicted for the deletion allele. All transformed strains contain 
both a wild-type (1.6 kb) and a deletion allele band (1.1 kb). Lanes 4 
and 5 also show extra bands, suggesting that the DNA has integrated 
into more than one site 

The hemA gene as a selectable marker 

In order to test the usefulness of hemA as a marker gene 
for fungal transformations, a plasmid was constructed 
that contained hemA and a heterologous lipase gene 
(pSE54; Fig. 2). The hemA gene used in this construct 
contained 523 bp of upstream DNA sequence. Use of 
this plasmid to transform hemA deletion strain, SE29-70, 
yielded 8.4 transformants/|ig DNA. Eighty-eight percent 
(21/24) of primary transformants tested in 24- well plates 
produced levels of lipase ranging over 20-330 lipase 
units (LU)/ml (Fig. 5A). In shake-flask tests, the highest 
level of lipase produced by a hemA /lipase transformant 
was 1000 LU/ml. Previous work has demonstrated that 
un transformed A. oryzae strains do not produce any 
lipolytic activity (Huge- Jensen et al. 1989). In addition, 
levels produced by untransformed strains in the assay 
used in this study are routinely below the sensitivity 
limit, which is <3 LU/ml (Michael Lamsa, personal 
communication). Southern blot analysis of Ziem^/lipase 
transformants confirmed that they contained at least one 
copy of the wild-type hemA gene (data not shown). In 
contrast, a transformation efficiency of 3.8 transfor- 
mants/jag DNA was obtained by transformation of 
strain SE29-70 with a plasmid containing the amdS gene 
and the same lipase expression cassette (pBANe8). 
Sixty-three percent (15/24) of these primary transfor- 
mants tested in 24-well plates produced levels of lipase 
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Fig. 5 A,B Lipase expression by hemA and amdS transformants. 
Histograms show the number of transformants from hemA transfor- 
mation (A) or amdS transformation (B) producing various levels of 
lipase. Spores from isolated transformants were grown in 24-well 
plates; and samples removed on day 7 were tested for lipase activity 



ranging over 95-2200 LU/ml (Fig. 5B). pBANe8 trans- 
formants produced up to 3700 LU/ml in shake-flasks. 



Discussion 

The goal of this work was to develop a new selectable 
marker system for use in filamentous fungi that did not 
require the use of antibiotics or specific carbon or ni- 
trogen sources to maintain selection. Enzymes of the 
heme pathway were specifically targeted as possible se- 
lectable markers since reports from S. cerevisiae sug- 
gested that gene deletions in the heme biosynthetic 
pathway could be rescued by exogenous addition of 
pathway intermediates or by reintroduction of a path- 
way gene on a plasmid (Bard and Ingolia 1990). Fur- 
thermore, a selection system based on heme biosynthesis 
was presumed to link aerobic growth rate to transgene 
expression. 

Cloning of the hemA gene encoding ALAS suggests 
that the mechanisms controlling expression of this gene 
are similar to those in other eukaryotes. A CCAAT 
motif found ca. 250 bp upstream of the start codon may 
be a site for the binding of the CCAAT-box transcrip- 



tion factor complex previously identified in S. cerevisiae 
(Keng and Guarente 1987), mammals (Guarente and 
Bermingham-McDonogh 1992) and plants (Edwards 
et al. 1998). Evidence for a similar transcription complex 
in Aspergillus (Kato et al. 1997) suggests that the 
CCAAT-box may have a similar function. 

HRMs, such as those identified in the putative mi- 
tochondrial localization sequence, have been implicated 
in heme-mediated regulatory events at both the tran- 
scriptional and post-translational level. Heme has been 
shown to bind to related motifs in the S. cerevisiae 
HAP1 protein, resulting in dissociation of a repressor 
factor and activation of HAP1 in the presence of heme 
(Fytlovich et al. 1993; Zhang and Guarente 1994). Also, 
when present in leader sequences, HRMs have been 
shown to prevent import of ALAS proteins into mouse 
mitochondria via direct interactions with heme (Lathrop 
and Timko 1993; Zhang and Guarente 1994). This motif 
has further been postulated to bind heme in other pro- 
teins such as heme lyases (Steiner et al. 1996) and heme 
oxygenase-2 (McCoubrey et al. 1997). Interestingly, the 
yeast protein does not contain HRMs and only two 
motifs are present in Aspergillus proteins, while a third 
motif is found in mammalian sequences (Fig. 3; Ferreira 
and Gong 1995). 

Deletion of the hemA gene proved to be lethal, and 
growth could only be restored by supplementing the 
growth media with ALA. Rich media containing yeast 
extract did not contain sufficient ALA to allow growth, 
suggesting that fermentation media would be selective 
for hemAAmked gene expression systems. Unlike S. ce- 
revisiae, hemA deletion mutants could not be rescued by 
the addition of 0.2 mg exogenous heme/ml in A. oryzae, 
even when ergosterol and vitamin B 12 were provided. In 
5. cerevisiae HEM1 deletion strains can be rescued by 
15 ug ALA/ml or 0.05 mg hemin/ml (Arrese et al. 1983; 
Volland and Urban-Grimal 1988). This difference. may 
be attributed to differences between S. cerevisiae and 
filamentous fungal cell wall permeability or potential 
heme transport mechanisms. 

It is interesting to note that initial attempts to delete 
the hemA gene failed to identify any primary trans- 
formants that were completely unable to grow in the 
absence of ALA. The inability to isolate "tight" 
auxotrophs immediately from primary transformants 
probably results from the transformation of multi- 
nucleate protoplasts and the subsequent segregation of 
the nuclei during serial culturing. After spore purifica- 
tion, strains carrying the hemA deletion were com- 
pletely unable to grow in the absence of exogenous 
ALA, even when transferred directly from plates con- 
taining ALA. 

Transformation of the hemA deletion strain with a 
plasmid containing the hemA gene and a lipase expres- 
sion cassette demonstrated that hemA can be success- 
fully used as a selectable marker. A plasmid containing 
both the lipase expression cassette and the hemA gene 
as a selectable marker produced primary transformants 
at an efficiency comparable to or better than control 
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transformations using the amdS selection system. The 
hemA transformants were easily distinguished and a high 
percentage (88%) of randomly chosen transformants 
produced detectable levels of lipase, indicating that they 
contained the transgene. This is in contrast to amdS 
transformants, which are often difficult to choose (63% 
were found to be transformed) due to the background 
of n on- transformed colonies that can be prevalent, de- 
pending on the strain background. Lipase expression 
levels in independent transformants were, on average, 5- 
to 6-fold lower than those found using amdS as a se- 
lectable marker. These results are similar to those found 
by Christensen (1994) in a study comparing the niaD and 
amdS selectable markers, where niaD transformants 
produced lipase levels over a lower range than did amdS 
transformants. The amdS marker is known to produce 
higher copy number integrants than other markers, such 
as argB or niaD (Christensen 1994). In the current study, 
Southern analysis of /zem^-lipase recombinants shows 
that the majority of hemA integrants are present in a 
single copy. This may be one reason for the lower levels 
of lipase observed in hemA transformants, when com- 
pared to amdS transformants. Although transformation 
with hemA can yield multiple integration events (as 
shown in Fig. 4), a single copy of hemA is apparently 
sufficient to adequately relieve the ALA auxotrophy. 
Manipulation of the hemA gene or promoter to create a 
debilitated allele may be useful in gaining recombinants 
with increased copy number, as has been demonstrated 
with the LEU2 allele in S. cerevisiae (Erhart and Hol- 
lenberg 1983). Current efforts are focused on testing this 
hypothesis in A, oryzae and extending the use of hemA 
as a selectable marker in other fungal expression 
systems. 
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Measuring in a quantitative, statistical sense the degree to which struc- 
tural and functional information can be "transferred" between pairs of 
related protein sequences at various levels of similarity is an essential 
prerequisite for robust genome annotation. To this end, we performed 
pairwise sequence, structure and function comparisons on ~30,000 pairs 
of protein domains with known structure and function. Our domain 
pairs, which are constructed according to the SCOP fold classification, 
range in similarity from just sharing a fold, to being nearly identical. Our 
results show that traditional scores for sequence and structure similarity 
have the same basic exponential relationship as observed previously, 
with structural divergence, measured in RMS, being exponentially related 
to sequence divergence, measured in percent identity. However, as the 
scale of our survey is much larger than any previous investigations, our 
results have greater statistical weight and precision. We have been able 
to express the relationship of sequence and structure similarity using 
more "modern scores," such as Smith-Waterman alignment scores and 
probabilistic P-values for both sequence and structure comparison. These 
modern scores address some of the problems with traditional scores, 
such as determining a conserved core and correcting for length depen- 
dency; they enable us to phrase the sequence-structure relationship in 
more precise and accurate terms. We found that the basic exponential 
sequence-structure relationship is very general: the same essential 
relationship is found in the different secondary-structure classes and is 
evident in all the scoring schemes. To relate function to sequence and 
structure we assigned various levels of functional similarity to the 
domain pairs, based on a simple functional classification scheme. This 
scheme was constructed by combining and augmenting annotations in 
the enzyme and fly functional classifications and comparing subsets of 
these to the Escherichia coli and yeast classifications. We found sigmoidal 
relationships between similarity in function and sequence, with clear 
thresholds for different levels of functional conservation. For pairs of 
domains that share the same fold, precise function appears to be con- 
served down to ~40% sequence identity, whereas broad functional class 
is conserved to ~25%. Interestingly, percent identity is more effective at 
quantifying functional conservation than the more modern scores (e.g. P- 
values). Results of all the pairwise comparisons and our combined func- 
tional classification scheme for protein structures can be accessed from a 
web database at http://bioinfo.mbb.yale.edu/align 
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Assessing Annotation Transfer for Genomics 



Introduction 

The problem of genome annotation 

Perhaps the most valuable information to be 
gained from a genome analysis is functional anno- 
tation of all the gene products. Unfortunately, of 
all the proteins whose sequences are known, func- 
tions have been experimentally determined for 
only a very small number (Andrade & Sander, 

1997) . Given the current size and accessibility of 
sequence and structure data, homologs of a newly 
sequenced gene's product can be identified via 
database searches, and probable structure and 
function assigned to the gene product (Bork et aL, 

1998) . This is based on the concept that sequence 
similarity implies structural and functional simi- 
larity. However, structural and functional annota- 
tions should be transferred with caution. If a 
protein is assigned an incorrect function in a data- 
base, the error could carry over to other proteins 
for which structure or function is inferred by hom- 
ology to the errant protein (Brenner, 1999; Karp, 
1996, 1998a). In large databases such an error can 
propagate out of control, presenting a serious qual- 
ity control issue as we move to larger genomes 
from multicellular organisms. 

Benchmarking fold and function recognition 

Here, we used manually curated structural and 
functional classifications as standards in analyzing 
to what degree annotations of a protein's structure 
and function can be transferred to a similar 
sequence. The knowledge gained from the study 
can be used to establish confidence levels for struc- 
ture and function prediction, improving our under- 
standing of how long it will take to annotate 
accurately an entire genome. 

Our simultaneous analysis of relationships 
between sequence and structure, sequence and 
function, and structure and function (Figure 1) 
may provide insight into paradigms for functional 
prediction other than that based alone on sequence 
similarity (Enright et aL, 1999). 

Past results 

Sequence-structure 

The transfer of structural annotation is well 
characterized. Chothia & Lesk (1986, 1987) found 
that structural divergence, when expressed in 
terms of the RMS separation of matching alpha 
carbon atoms, was an exponential function of 
sequence divergence, expressed in terms of the 
fraction of residues that differed between 
sequences. The reliability of structural annotation 
transferred by homology, then, depends on the 
sequence identity of the homologous proteins 
(Chothia & Lesk, 1986). Flores et aL (1993), Russell 
& Barton (1994), and Russell et aL (1997) observed 
the same general trend, and also characterized the 
conservation of structural features other than the 



C a backbone, such as secondary structure, accessi- 
bility and torsion angles. A paper by Wood & 
Pearson (1999) re-expressed the sequence-structure 
relationship in terms of statistically based "Z- 
scores" and found that this relationship had a 
simple linear form in terms of these scores. They 
also noted that protein families differed in detail in 
the slope of this linear relationship. 

Others have focused on the limits of sequence 
comparison, specifically around the "twilight 
zone," the region of sequence similarity that does 
not reliably imply structural homology (Doolittle, 
1987), and on establishing cut-offs for significant 
sequence similarity. Using the SCOP structural 
classification (Murzin et aL, 1995), Brenner et aL 
(1998) benchmarked the effectiveness of the popu- 
lar FASTA and BLASTP programs and their prob- 
abilistic scoring schemes (i.e. the e-value) (Pearson 
& Lipman, 1988; Pearson, 1996; Altschul et aL, 
1990, 1994; Karlin & Altschul, 1993). They found 
that in making fold assignments, the FASTA 
e-value closely tracked the number of false posi- 
tives, i.e. the error rate, and that at a conservative 
e-value cut-off of 0.001, the FASTA program could 
detect nearly all the relationships that would be 
detected by a full Smith-Waterman comparison 
(Smith & Waterman, 1981). Specifically, they found 
that FASTA with a 0.001 threshold would find 
16% more of the structural relationships in SCOP 
than would be found by standard sequence com- 
parison with a 40% identity threshold. This rigor- 
ous benchmarking approach has been extended to 
assess transitive sequence comparison, through a 
third intermediate sequence and multiple-sequence 
matching programs such as PSI-blast (Park et aL, 
1997, 1998; Gerstein, 1998a; Salamov et aL, 1999). In 
a related study Rost (1999) worked on characteriz- 
ing the region after the twilight zone, which he 
called the "midnight zone". In a sense these bench- 
marking studies have culminated in the CASP fold 
recognition experiments (Moult et aL, 1997; 
Sternberg et aL, 1999). 



Sequence-function 

Although the exact dependence of functional 
similarity on sequence and structural similarity is 
not completely clear, initial indications of a gene 
product's function are most often based on simple 
sequence similarity (Bork et aL 1994, 1998). Often 
these are merely based on the best hit in database 
comparisons; see, for example, the annotation of 
some of the early genomes (Fraser et aL, 1995, 
1998). However, possibilities for more robust anno- 
tation transfer are increasingly available. One looks 
at the pattern of hits amongst different phylo- 
genetic groups (Tarusov et aL, 1997). Often these 
focus on the existence of key motifs and patterns 
associated with function (Zhang et aL, 1998; Bork & 
Koonin, 1996; Attwood et aL, 1999). 
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Figure 1. This Figure schematically depicts certain aspects of our comparison methodology, (a) The paradigm relat- 
ing sequence to structure to function. There has not been as much assessment of functional annotation transfer based 
on structure as there has been with sequence-based structural and functional annotation transfer, (b) How we concep- 
tualized our analysis in terms of pairs. A few examples of SCOP domains (identified on the left and bottom) are 
included from our comparison. In the Figure the shape represents fold, and the pattern represents function. We have 
highlighted some example categories of pairs: a pair that shares fold and function, a pair that shares fold but not 
function and a pair that shares neither fold nor function. The latter category of pairs is not considered in our investi- 
gation; we looked only at paired domains with the same fold. In constructing our pairs, we used only a representa- 
tive set of SCOP domains. This is illustrated in the Figure by the domains flagged with asterisks. Note, in particular, 
that the SCOP domain d4tima_ is not paired with anything because it is represented by d5tima_, which is the same 
species and protein. For each level of pairs (fold, superfamily, family), cluster representatives were chosen for the 
level below: (i) for family pairs, one representative was selected from each species /protein, the level below, and then 
paired with all the other representatives within its family; (ii) for superfamily pairs, one representative was chosen 
from each family, unless there were domains in the family that shared less than 40 % sequence identity, in which case 
additional representatives were included, each not more than 40 % identical with the other representatives from the 
family (this occurs, for instance, for the globins); and (iii) likewise for fold pairs, one representative was chosen from 
each superfamily, more if there were domains with less than 40 % sequence identity, (c) Subdivides the pairs into the 
four SCOP classes from which they were composed: (i) all-a, domains consisting of a-helices; (ii) all-p, domains con- 
sisting of p-sheets; (iii) a/p, domains with integrated a-helices and p-strands; and (iv) a + p, domains with segregated 
a-helices and p-strands. We initially set apart the immunoglobulins from the rest of the all-p pairs because we rea- 
lized that their large number biases our data. However, we compared the results for the immunoglobulin pairs to all 
other pairs and found that they generally exhibit the same behavior as the other pairs. Therefore we decided to leave 
them in the comparison. 



Sequence-structure-function 

One way that the better-defined sequence-struc- 
ture relationship can assist in function prediction is 
initially to predict the structure of an uncharacter- 
ized sequence and then predict the function based 
on the limited repertoire of functions known to 
occur with that structure. To some degree this was 
achieved by Fetrow and co-workers (Fetrow et ah, 
1998; Fetrow & Skolnick, 1998). They predicted 
structural profiles based on threading and ab initio 
methods, and then searched with these against 
profiles of known structures in order to predict 
function. 

In related work, Russell et ah (1998) discussed 
using identification of structural binding sites in 



predicting protein function. In a comprehensive 
study, Hegyi & Gerstein (1999) investigated to 
what degree folds were associated with functions. 
They found that most folds were associated with 
one or two functions with the exception of a few 
special folds, such as the TIM barrel, that could 
carry out numerous functions. Furthermore, they 
found that particular folds were often confined to 
distinct phylogenetic groups, an additional fact 
that can feed into an integrated sequence-structure- 
function analysis (Gerstein & Hegyi, 1998; 
Gerstein, 1997, 1998b,c). 

Here, we look at pairwise comparisons of 
protein sequence, structure and function among 
proteins that share the same fold. We assess the 
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trends relating sequence, structure and function 
and consider the implications for structural and 
functional annotation transfer. 

New developments: probabilistic scoring and 
growth of the databank 

The past studies regarding sequence, structure 
and function relationships often used RMS separ- 
ation and percent sequence identity (or a linear 
variant of it, such as the fraction of mutated resi- 
dues) to express similarities in structure and in 
sequence, respectively. However, it has become 
increasingly common to use probabilistic scoring 
schemes (P-values) to express the quality of a 
match in terms of statistical significance rather 
than an arbitrary raw score such as percent iden- 
tity (Pearson, 1998; Karlin & Altschul, 1990, 1993; 
Karlin et al 1991; Altschul et al 1994; Bryant & 
Altschul, 1995; Abagyan & Batalyov, 1997). With 
P-values, scores from different investigations can 
be compared in a common framework. Recently, it 
was found that sequence and structure similarity 
significance can be expressed as P-values in the 
same unified statistical framework (Levitt & 
Gerstein, 1998). Here, we use such probabilistic 
scoring methods to overcome the limitations of the 
more traditional scores. 

Another recent development is the tremendous 
growth in the number of solved structures. The 
RCSB Protein Data Bank (Bernstein et al 1977) now 
contains more than 10,000 protein structures. These 
structures are broken into more than 18,000 
domains, and then domains that share a fold are 
paired up with each other for comparison 
(Figure 1(b)). Here, we survey ~30,000 pairs of 
protein domains that are known to have the same 
fold, approximately 1000 times the number com- 
pared by Chothia & Lesk (1986). The large scale of 
this comparison affords greater statistical weight to 
the results. 

Alignment of 30,000 pairs from SCOP 

The basic unit of comparison: a pair of 
protein domains 

The protein domains that we studied were classi- 
fied by SCOP, a Structural Classification of Pro- 
teins (Murzin et al 1995; Brenner et al 1996; 
Hubbard et al 1997), a hierarchy of five levels: 

(i) class, domains that have the same secondary 
structural content (all-a, all-0, a/P, or a+P); 

(ii) fold, domains that geometrically share the same 
tertiary fold; (iii) superfamily, domains descended 
from the same ancestor (but which lack measurable 
sequence similarity); (iv) family, domains in the 
same protein sequence family (which have appreci- 
able sequence similarity); and (v) species and 
protein. 

Pairs of protein domains that are grouped 
together at the fold, superfamily or family level 
form the basic unit of our comparisons. 



Selection of pairs 

There is potentially a huge number of pairs of 
domains that can be constructed out of the 
relationships in SCOP. For instance, in the current 
version of SCOP there are ~3.9 million potential 
pairs between domains sharing the same fold. 
Most of these are between nearly identical struc- 
tures. In order to keep the number of pairs man- 
ageable, we used a straightforward clustering 
scheme, described in the legend to Figure 1. We 
selected 29,454 representative pairs from the total 
in SCOP. To achieve a wide range of similarities, 
we constructed the pairs on three levels of the 
SCOP hierarchy: (i) family pairs, 19,542 pairs of 
domains in the same family; (ii) superfamily pairs, 
4220 pairs of domains in the same superfamily 
but different families; and (iii) fold pairs, 5692 
pairs of domains in the same fold but different 
superfamilies. 

All the selected domains were at least 50 resi- 
dues in length and were drawn from the four 
major SCOP secondary-structural classes: all-a, all- 
p, a/p, and a + p (Figure 1(c)). 

We automatically aligned each of our selected 
domain pairs twice, once by global Needleman- 
Wunsch sequence comparison (Needleman & 
Wunsch, 1971; Myers & Miller, 1998) and then 
by structure (Gerstein & Levitt, 1996, 1998), cal- 
culating scores for sequence and structural simi- 
larity. 

Web-accessibte database 

The results of all the pairwise comparisons are 
available via a searchable database on the web at 
http://bioinfo.mbb.yale.edu/align The query 
engine allows searches of individual SCOP pairs, 
all pairs that include a given SCOP domain, or all 
pairs containing any SCOP domain contained in a 
given PDB entry. 

Traditional scores: RMS and percent identity 

The sequence-structure relation, as expressed by 
the root-mean-square (RMS) of the aligned C a dis- 
tances and percent sequence identity, has been pre- 
viously characterized as an exponential function by 
Chothia & Lesk (1986) and others (Flores et al 
1993; Russell & Barton, 1994; Russell et al 1997). 
As Figure 2 illustrates, our data display a similar 
trend. (Exact equations are given in the legend to 
Figure 2.) However, we have one thousand times 
as many data points as in Chothia and Lesk's orig- 
inal study (30,000 as opposed to 30). 

The main difference between our results and 
the previous studies ' is due to differences in 
RMS "trimming" methods. By trimming we refer 
to the process of removing the worst-fitting 
aligned atoms from the RMS calculation, to 
arrive at a structural "core." This was first 
developed in Lesk's sieve-fit procedure (Lesk & 
Chothia, 1984) and has been refined in numer- 
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ous studies (e.g. Gerstein & Altman (1995)). This 
is done because the small distances between 
well-matched alpha carbon atoms have much 
less of an effect on the RMS than do the very 
large distances between poorly matched atoms. 
The untrimmed score of divergent protein 
domains is then concerned primarily with the 
poorly matched residues instead of the con- 
served core. Trimming alleviates this effect by 
restricting the RMS calculation to include only 
those residues believed to be in the conserved 
core. However, the degree of trimming is to 
some extent arbitrary, and this choice affects the 
baseline of the reported RMS scores. Here we 
considered only the better half (50%) of matched 
residues in a given pair of protein domains. 
Chothia & Lesk (1986) chose a somewhat differ- 
ent threshold. Figure 2(c) and (d) demonstrate 
the effect of trimming. 



Analogous alignment similarity scores: Smith- 
Waterman score and structural 
comparison score 

The dependence of the RMS separation on trim- 
ming method restricts its usefulness in comparing 
data. Likewise, there are many problems with 
using percent identity as a measure of sequence 
similarity. For instance, a match of non-identical 
but still similar residues (e.g. Arg versus Lys) scores 
the same as one between completely different resi- 
dues (e.g. Arg versus Val), and gaps do not enter in 
the score calculation. Consequently, we now turn 
to alignment similarity scores, which eliminate 
some of the problems with traditional scores. 

For sequence alignments, an alignment score is 
defined as the sum of the similarity matrix values 
for the alignment, minus the total gap penalty. 
This is sometimes called the Smith-Waterman score 
(Smith & Waterman, 1981). An analogous align- 
ment score for structure is the structural compari- 
son score, described by Levitt & Gerstein (1998). 
We will refer to these two similarity scores as 
and S str , respectively. Note that they both increase 
for more similar pairs, whereas RMS increases for 
more divergent pairs. Specifically, S str is the score 
maximized by the structural alignment program 
we used (Gerstein & Levitt, 1998). It can be calcu- 
lated from any pair of aligned structures according 
to the function: 
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M and d 0 are constants, usually set to 10 and 5 A, 
N gap is the number of gaps in the alignment, d x is 
the distance between each aligned pair of C a 
atoms, and the sum is carried over all aligned 
pairs, i. 



The main advantage of S str over RMS in describ- 
ing structural similarity is that the C a to C a 
distance, & u appears in the denominator of the cal- 
culation. This means that the smallest distances, 
corresponding to the best matches in the conserved 
core, are most significant in determining the score. 
Hence, the need for trimming is eliminated. S str is 
also advantageous because it takes gaps into 
account and because of the fundamental analogy 
between this score and S^. 

Figure 3(a) displays the relationship between 
structural and sequence similarity as expressed by 
S str and Sggq. Figure 3(c) and (d) show calibration 
curves relating each of these scores back to 
approximate RMS separation and percent identity, 
respectively. Calibration curves help one get an 
intuitive feel for the degree of relationship in terms 
of the more traditional scores. Figure 3(b) adds a 
third axis, alignment length, and demonstrates that 
S str depends greatly on this quantity. Although S str 
and are "better" scores than RMS and percent 
sequence identity, the heavy dependence of both of 
these on length limits their usefulness in many 
situations. In other words, two pairs of similar 
domains with equal percent sequence identities but 
different lengths can have drastically different 
scores. 

Probabilistic scores: P-values expressing the 
significance of sequence and 
structure similarity 

Probabilistic scores can, to a great degree, over- 
come the length-dependence problems associated 
with the alignment scores. Probabilistic measures 
are advantageous because they express similarity 
not by an arbitrary "score" but by a statistical sig- 
nificance: the likelihood that such a similarity 
could be achieved by chance. This likelihood is 
also called the "P-value." We used calculations 
(described in detail in the legend to Figure 4) 
based on those given by Levitt & Gerstein (1998) to 
obtain P-values based directly on S str and S^; we 
refer to these calculated P-values as P str and P^, 
respectively. For P^ we could equally well have 
used the numbers from one of the popular 
sequence search programs (i.e. BLAST or FASTA) 
as all these values have been shown to be perfectly 
proportional to each other (Levitt & Gerstein, 1998; 
Brenner et ah 1998). 

P^ and P str can be used to express the relation- 
ship between structure and sequence similarity on 
a more fundamental level. Figure 4(a) shows a log- 
log (base 10) plot of P str against P^. Because it is 
log-log, trends can be visualized as straight lines. 
Two straight lines are necessary to fit the points 
well, with the discontinuous boundary between 
the lines located at the beginning of the twilight 
zone. The different slope of the line at low 
sequence similarity reveals that in the twilight 
zone there is a different relationship between the 
significance of structural similarity and that of 
sequence similarity. In particular, for domain pairs 
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Figure 2. RMS as a function of percent identity, (a) A simple scatter plot of our pairs, relating RMS separation to 
percent sequence identity. This is similar to the presentation given by Chothia & Lesk (1986), but in this survey we 
looked at 30,000 pairs, 1000 times the number they compared. Outliers (pairs with RMS scores further than two stan- 
dard deviations from the mean for their percent identity) are excluded from this graph; they represent domains that 
are very closely related with the exception of a conformational change, (b) A simplified graph with a number of fits 
to the data. For each percent identity bin we show the median RMS value, indicated by (♦) and the top and bottom 
quartile RMS values, indicated by the bars. Two fits are drawn through the median RMS values. The thin line, 
labeled SINGLE, is a simple exponential fit through the medians. It has the form: 



0.01 32H 



R = 0.21e' 

where- R is the RMS deviation after least-square fitting, H is the percent difference between the sequences (H for 
Hamming distance), and H= 100% - /, where J is the percent sequence identity. The thick line, labeled MULTI, is a 
multigraph fit, which is described in the legend to Figure 4. The relation between RMS and percent identity according 
to this fit is expressed by the equation: 

R = 0.18e 00187H 

The twilight zone of sequence identity and below is labeled TZ. In this region, sequence similarity is not significant 
and not reliable for predicting structural similarity. This is why the median values in this area of the graph deviate 
significantly from the fits, which consider only data above 20% sequence identity. For reference we include the orig- 
inal data points from Chothia and Lesk's, 1986 paper (A.M. Lesk, personal communication), indicated by X. Their 
data follow the form: 

R = 0,40e 00187H 

The difference between the Chothia & Lesk trend and our relationship is due to the different trirnming methods used 
in calculating the RMS score. Chothia and Lesk imposed a 3 A cut-off in determining the conserved core residues; we 
defined the core as the better matching (in terms of C* distances) half (50 %) of the residue pairs, (c) and (d) The 
effect our trimming has on median RMS values. The RMS values in (c) are calculated from all the matched residues 
in each pair; the values in (d) are calculated from the better matching 50 % of the residues. 



in the twilight zone (according to the percent iden- 
tity to Pggq calibration in Figure 4(b)), structural 
similarity is more significant than sequence simi- 
larity (having a smaller P-value or more negative 
log P-value). In contrast, for pairs with more than 
~-'30% identity, the situation is reversed, with a 
given pair having more significant sequence simi- 
larity than structural similarity. One possible 
interpretation of this reversal is as follows. Struc- 
ture is always more highly conserved than 
sequence, so usually a given amount of structural 
similarity is not as significant as a corresponding 
amount of sequence similarity. However, this is 
true only when meaningful sequence similarity 



actually exists; thus, it does not apply in the twi- 
light zone, where sequence similarity is by defi- 
nition not significant. Note that all pairs in our 
comparison share at least the same fold, implying 
that they always have a significant amount of 
structural similarity. 

In other words, for closely related sequences, 
differences in sequence similarity are more mean- 
ingful, whereas for highly diverged sequences that 
share the same fold, the differences in structural 
similarity are more significant. 

Fitting two lines to the P str versus P^ graph 
suggests that the same might be done for other 
scoring schemes. It is possible to some degree to fit 
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Figure 3. Similarity scores: structural comparison score as a function of Smith-Waterman score. Alignment simi- 
larity scores S str and have certain advantages over RMS and percent identity scores for expressing the sequence- 
structure relation. S str is calculated according to equation (1) in the text (Gerstein & Levitt, 1998; Levitt & Gerstein, 
1998). is calculated using the BLOSUM50 matrix (Henikoff & Henikoff, 1992) with gap opening and extension 
penalties of -12 and -2, respectively, (a) This is analogous to (b) in Figure 2. From the original 30,000 pairs we show 
the median S str value for each bin, along with quartile bars above and below. Again the twilight zone and below 
is labeled TZ. The thin line, marked SINGLE, is a simple fit to the median S str values in this graph; it has the form: 

S str = 2144 - 1106exp(-0,00544S seq ) 

The thick fit, marked MULTI, is the multigraph fit, explained below. It follows the equation: 

S str = 2157 - 787exp(-0.0028S seq ) 

The equations presented here provide an approximation of the observed trends; as (b) illustrates, they are nothing 
more than simple approximations. The main disadvantage of S str as a measure of structural similarity is its heavy 
length dependency for pairs of structurally similar protein domains, (b) Surface plot of the median S str as a function 
of and alignment length (the number of matched residue pairs). It is clear that the size of the aligned domains 
plays a major role in the resulting S str , even though our fits do not take length into account, (c) and (d) Relate 
and S str to the more familiar percent identity and RMS measures. The fits were used to convert between scoring 
schemes in constructing the multigraph fit. We derived the multigraph fit in order to create one set of equations and 
parameters that would relate sequence and structural similarity using either the percent identity and RMS scheme or 
the and S str scheme, and allow translation between them. We simultaneously performed least-squares fits to the 
median values in four graphs: Figures 2(b) and 3(a) and the calibrations of to percent identity and S str to RMS, 
(c) and (d), respectively. In all cases, we ignored data in and below the sequence identity twilight zone (labeled TZ). 
The parameters in (a) are dependent on the parameters in Figure 2(b) via the mentioned calibrations. 



the traditional RMS versus percent identity graph 
(Figure 2) with two straight lines instead of an 
exponential cruve. However, in this case, we opted 
for the more conventional presentation. 

Class differences 

The division of SCOP into classes based on sec- 
ondary-structural composition allows easy investi- 
gation as to whether there are any deviations from 
the common similarity relationships on account of 
secondary-structure characteristics. Figure 5(a) 
reveals that secondary structural composition does 
not markedly affect the trends in sequence and 
structure similarities. This is consistent with the 



data given by Wood & Pearson (1999). However, 
the larger average length of a/P domains com- 
pared with domains in the other classes results in a 
deviation in the length-dependent S str (Figure 5(b)). 
The consistency among length-independent scores 
applies for certain individual folds as well. The 
immunoglobulin fold makes up an appreciable 
fraction of all the P-pairs (Figure 1(c)), yet the 
results are not affected if these pairs are left out. 

Linking sequence and structure to function 

Difficulties of functional comparison 

There is a clear, well-characterized relationship 
between sequence and structure similarity, which 
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Figure 4. Probabilistic scores: P- values. Pseq and P str are P-values calculated from and S str according to the 
formalism given by Levitt & Gerstein (1998). Both quantities have the same overall functional form in terms of an 
extreme value distribution: 

P — 1 — exp(— exp(— Z)) 

where P is either P^ or P str . For P^, Z = S™/a - 2 InM - b/a, where a = 5.84, = - 26.3, and M is the geometric 
mean of the lengths of the two sequences (i.e. M 2 — nm, where n and m are the two sequence lengths). For P str , Z is a 
function of S str and N, the number of matched residues: For N < 120: 

Z = (S s tr - cln 2 N - dh\N - e)/(fk\N +g) 

For N ^ 120: 

Z = (^ - fllnN - b)/(f In 120 + $) 

At N = 120, continuity implies that: 

alnl20 + fe = cln 2 120 + dlnl20 + e and a = 2clnl20+i 
This, in turn, allows the calculation of the constants: 

a = 171.8, b = -419.4, c = 18.4, d = -4.50, e = 2.64, / = 21.4, g = -37.5 

(a) of this Figure is analogous to Figures 3(a) and 2(b), with the exception of the fits. It is a log-log (base 10) plot 
relating P™ and P str . We show the median log(P str ) value for each logfP^) bin, along with quartile bars above and 
below. We have added approximate percent identity and RMS values to the x and y axes to aid interpretation of the 
graph in terms of more familiar scores. The values were calculated using the calibration curves in (b) and (c). The 
straight-line nature of the log-log plot reveals distinct relations inside and outside the twilight zone, labeled TZ. (The 
area of percent identity below the twilight zone does not appear in P^ graphs, there is no significance for such low 
sequence similarity; thus all data points in that zone appear at P^ = 1 or loglP^] = 0.) The thick line in the figure is 
fit to the median P str values for P^ values outside the twilight zone; its equation is: 

-lOnO.05 



P str = 10" lu P' 



seq 



The thin line is fit to the data inside the twilight zone; it follows the relation: 

Rtr = 10- 6 P?£ 74 



seq 



For reference we include the dotted line, representing the function P str = P™, where sequence and structural simi- 
larity are equally significant. See the text for a discussion of how the two trends might be interpreted with respect to 
this line. 



can be used to transfer precisely structural annota- 
tion based on the degree of sequence homology. In 
genome analysis, however, one is usually more 
interested in rinding a functional annotation for an 
open reading frame based on similarity to well- 



known proteins; yet the sequence-function and 
structure-function relationships have not been as 
explicitly characterized. The fundamental obstacle 
to extending this and similar investigations to deal 
with function is the absence of a clear measure of 
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Figure 5. SCOP class differences. Previously it has 
been observed that secondary structural composition 
does not cause deviations from the trends in structure 
and sequence similarity (Flores et al. 1993). To test this 
observation we looked at the scores divided by SCOP 
class. The following legend applies to the graphs: ( — 
■ — ), all alpha; (- ♦ -), all beta; (--A--), alpha/beta; 
(--x--), alpha + beta, (a) Median RMS values for 
each percent identity bin. The traditional scores reveal 
no dependency on class. However, in (b) a/ (3 pairs con- 
sistently score higher S str scores than pairs in other 
classes. This is a consequence of the dependence of S str 
on length; domains in the a/ p class are longer, on aver- 
age, than in the other classes. 



functional similarity. Although we were able to 
present three different quantitative measures of 
structural relatedness, an analogous situation for 
function does not exist. How can one express 
quantitatively the degree of similarity between a 
triosephosphate isomerase and a glucoses-phos- 
phate isomerase? How do they compare to trp 
repressor? 

The absence of a clear measure of functional 
similarity is not the only obstacle in transferring 
the functional annotations between proteins with 
different degrees of homology. The definition of 
function itself is often vague. More specifically, at 
present there is an absence of such important infor- 
mation as a standardized vocabulary for protein 
functional annotations with an associated number- 
ing scheme, descriptions of monomer functions of 
subunits of multisubunit proteins and hierarchical 
functional assignments for proteins with multiple 



functions. As a consequence of these difficulties 
there is no functional equivalent to the hierarchical 
fold classification for domains in PDB. 

As signs of progress in this direction, several 
functional classifications have been developed to 
date. One is the ENZYME system developed by 
the Enzyme Commission (EC) to classify enzymes 
by reaction type (Webb, 1992). This system has the 
advantage that it is "universal/' applicable to 
proteins in many different organisms, and is in 
wide use. However, it also has several drawbacks. 
First of all, it does not consider catalytic reaction 
mechanisms (Riley, 1998a), often ignoring obvious 
similarities. Second, it presumes a 1:1:1 relationship 
between gene, protein and reaction, although this 
is often not the case (an enzyme can have 
two functions, or two polypeptides from two 
different genes can oligomerize to perform a single 
function). Perhaps the most significant drawback 
of the EC classification is that it applies to only 
enzymes. 

A number of more comprehensive schemes 
have been developed, which classify non- 
enzymes as well as enzymes. Most of these 
focus on individual organisms. Several such 
schemes exist, for instance, GenProtEC/EcoCyc 
for E. coli (Karp et al, 1998b; Riley & Labedan, 
1996; Riley, 1998b), MIPS for yeast (Mewes et al, 
1998), Ashburner's functional classification for 
Drosophila, which is connected to FLYBASE 
(Ashburner & Drysdale, 1994), and EGAD for 
human ESTs (Adams et al, 1995). These classifi- 
cations possess some advantages. They have 
additional levels of hierarchy that help present a 
more comprehensive picture of genotype-pheno- 
type relationships. On the other hand, these 
classifications still leave much room for improve- 
ment. For example, there is no standardized 
vocabulary to allow for keyword searches 
among multiple databases and across organisms, 
and there are inconsistencies in category num- 
bering style. 

Finally, there has been some promising work 
going beyond the ENZYME and organism-focused 
classifications. There has been progress on comple- 
tely automated functional classification (des Jardins 
et al, 1997; Tamames et ah, 1997), which has the 
potential for putting function assignments on a 
more objective basis. There are a number of data- 
bases synthesizing the various enzyme functions 
into coherent pathways and systems (e.g. KEGG 
and WIT, Ogata et al, 1999; Selkov et al, 1998). 
There also have been some very recent attempts to 
develop cross-species classifications of non-enzyme 
functions in the framework of the Gene Ontology 
Project (GO, geneontology.org). GO is a joint pro- 
ject between Fly Base, the Saccharomyces Genome 
Database and Mouse Genome Informatics, 
attempting to merge the fly, yeast and mouse 
functional classification schemes. However, a truly 
universal system for classifying all protein func- 
tions in all organisms within the same framework 
remains quite a challenge because of the 
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sheer diversity of organisms and distinct protein 
functions. 



Our simple functional classification of SCOP 
domains: FLY+ ENZYME 

Given the discussed limitations, we constructed 
a simple functional classification for the SCOP 
domains included in our comparison; our classifi- 
cation is based on a merger of two of the existing 
functional annotations and a cross-referencing of 
subsets^ of this combination with some of the 
organism-specific schemes. First, we used pairwise 
comparison to cross-reference the PDB domains 
against the Swissprot database (Bairoch & 
Apweiler, 1998), as described by Hegyi & Gerstein 
(1999). We chose to assign protein functions 
according to Swissprot because it provides more 
comprehensive functional annotations than SCOP. 

We were initially able to divide all entries into 
enzymes and non-enzymes, a division that rep- 
resents the highest level of functional difference in 
our classification scheme (Figure 6). For the 
enzyme category, we transferred EC (Webb, 1992) 
numbers to those SCOP domains with a one-to-one 
match to a Swissprot enzyme. Only one-to-one 
matching entries could be considered because 
Swissprot assigns ENZYME numbers to entire pro- 
teins, whereas SCOP is a domain-based classifi- 
cation; therefore we could be confident about the 
classification of only those domains which map to 
an entire Swissprot entry. 

In the absence of an EC-type classification for 
non-enzymes, we assigned functions to non-enzy- 
matic SCOP domains according to Ashburner's 
original classification of Drosophila protein func- 
tions. This classification is derived from a con- 
trolled vocabulary of fly terms. It is available on 
the web and loosely connected with the FLYBASE 
database (Ashburner & Drysdale, 1994). For clarity, 
we precisely describe the specific files and version 
(1.55, 1997) of the classification that we used in the 
caption to Figure 6, and we will hereafter refer to 
these data files as constituting the original FLY 
classification. 

The FLY classification is a dynamic object, chan- 
ging as more is learned about the fly and other 
organisms. This is particularly true of late with the 
imminent completion of the Drosophila genome. In 
fact, since the completion of our analysis, the FLY 
classification has been superceded by the new GO 
classification (see above). 

The hierarchical structure of the FLY classifi- 
cation makes it well suited for classifying non- 
enzymatic SCOP entries in a manner comparable 
to the ENZYME assignments for the enzymes. 
Another advantage of this classification is that it is 
more compatible with the makeup of the PDB than 
the E. coli and yeast classifications, as Drosophila is 
a multi-cellular organism, and many of the known 
structures come from animals. We were able to use 
the original FLY classification as a framework to 



which we added functional categories and individ- 
ual proteins. For instance, we added "Hemo- 
globin" to the "Physiological Processes 
Respiration" category. Another example is the 
"Physiological processes - Immunity" category 
(Figure 6(b)), to which we added immune system 
proteins. Many of the additions would not be 
necessary in the context of the new cross-species 
GO system. We also modified slightly the number- 
ing scheme in the original FLY classification in 
order to assign a unique hierarchical number to 
each protein domain (Figure 6(b)). We will refer to 
our augmented FLY classification as the FLY 4- 
scheme, and our merged scheme as the FLY+ 
ENZYME classification. 

As discussed earlier, the universal functional 
classification of proteins is very challenging and 
may not be possible with the current level of 
knowledge about genes, proteins and genomes. 
Consequently, the FLY + ENZYME classification 
of SCOP proteins is somewhat incomplete and 
inconsistent and retains many of the limitations 
of its components (Hegyi & Gerstein, 1999; 
Riley, 1998a). It is not yet broad enough to 
include many plant, virus and bacterial proteins. 
Nevertheless, it was sufficient for our analysis, 
as we were able to classify a very large number 
of the total 30,000 pairs. 



Determining functional similarity 

Using our compound functional classification, 
we were able to assign a level of functional simi- 
larity to each domain pair. According to our 
scheme, a pair can have no functional similarity 
(an enzyme paired with a non-enzyme) or it can 
have one of three levels of similarity: 

(i) General similarity. Both domains are 
enzymes or both are non-enzymes. 

(ii) Same functional class. Both domains share 
the first component of their ENZYME or FLY + 
numbers, e.g. 1.1.1.1 alcohol dehydrogenase and 
1.3.1.1 cortisone beta-reductase (for enzymes), or 
3.3.2.1.2 calcicyclin and 3.6.3.2.1 calmodulin (for 
non-enzymes). 

(iii) Same precise function. Both domains share 
three components of their ENZYME or FLY + 
number, e.g. 1.1.1.1 alcohol dehydrogenase and 
1.1.1.3 homoserine dehydrogenase (for enzymes) 
or 1.2.9.1.1.1 Arc repressor and 1.2.9.1.1.1 C-jun 
(for non-enzymes; both are transcription factors). 
A pair that shares precise function must also, by 
definition, share functional class and general 
similarity. 

Based on those assignments we calculated the 
percentage of total pairs at a given level of 
sequence or structural similarity possessing each 
level of functional similarity. The results appear in 
Figure 7. 
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Sequence and function 

The relation between sequence similarity and 
functional similarity behaves as one might expect, 
with sigmoidal curves that drop off sharply at par- 
ticular conservation thresholds, and with the three 
levels of functional similarity (precise function, 
functional class and general similarity) having pro- 
gressively lower thresholds. Figure 7(a) shows that 
precise function is not conserved below 30-40% 
sequence identity, whereas functional class is con- 
served for sequence identities as low as 20-25%. 
Below 20%, general similarity is no longer con- 
served; among pairs of approximately 7% 
sequence identity, about 40% are enzymes paired 
with non-enzymes. It is important to note that in 
all the pairs considered here, the domains share 
the same fold. Functional similarity at low percent 
identities (e.g. 7%) would be much less for all 
possible pairs of domains rather than just for those 
with the same fold. It is also important to remem- 
ber that our thresholds for functional conservation 
are statistical averages over many sequences; one 
will, of course, be able to find individual cases that 
diverge more or less rapidly. 

There are differences between the functional con- 
servation thresholds of enzymes and non-enzymes, 
with enzymes appearing to more highly conserve 
precise function than non-enzymes, but non- 
enzymes conserving functional class more highly 
than enzymes. This may reflect that in our classifi- 
cation, the non-enzyme functional classes are 
broader and hence easier to conserve than those of 
the enzymes, while the non-enzymatic precise 
functions are more specific. 

When is used as the measure of sequence 
similarity (Figure 7(b)) the results look somewhat 
different, it appears that functional class is con- 
served for the entire range of sequence similarities. 
In this case, percent identity is actually more discri- 
minating than because functional class 
diverges only at sequence similarities that are low 
enough that they have little or no statistical signifi- 
cance, i.e. for Pggq the divergence is compressed 
near the vertical axis of the graph. 

Structure and function 

The relation between similarity in structure and 
function is somewhat less straightforward than 
that between similarity in sequence and function. 
Figure 7(c) shows the relationship between RMS 
and functional similarity. Broadly, it appears simi- 
lar to that for percent identity and functional simi- 
larity; however, the thresholds for conservation of 
the various types of functional similarity are less 
sharp. 

RMS is more revealing with respect to functional 
similarity than the non-traditional structural scores, 
S str and P s , r (Data for S str and P str are not shown 
but are available from the website.) The reason is 
that, while very structurally similar pairs all have 
RMS scores clustered between 0 and 0.5 A, S str has 
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a large range of scores for similar pairs due to the 
length dependency, and P str does not have any 
limit for maximum similarity. The wide range of 
possible S str and P str scores for similar structures 
tends to blur the broad sigmoid curves so much so 
that they are no longer apparent. 

Alternative functional classifications: MIPS 
and GenProtEC 

To get some perspective on the degree to which 
our results reflected the particularities of our com- 
bined FLY + ENZYME classification, we decided 
to try the same comparisons based on the well- 
known functional classifications for yeast and 
E. coli, MIPS and GenProtEC (Mewes et al, 1998; 
Riley & Labedan, 1996; Riley, 1998b). These classi- 
fications have the advantage that they integrate 
enzyme and non-enzyme functions from the start 
and are widely used. However, as they are only 
applicable to individual organisms, we could only 
use them to classify a considerably smaller subset 
of the known structures than the compound FLY 4- 
ENZYME system. 

The specific way we used the MIPS and Gen- 
ProtEC classifications to assign function to struc- 
tures and to calculate functional similarities is 
described in the legend to Figure 7. Our results 
in terms of functional conservation (precise and 
class) at various levels of percent identity are 
shown in Figure 7(d). We observe the same gen- 
eral relationships as we did for our FLY - 
4- ENZYME scheme. That is, the functional 
conservation curves have a sigmoidal shape and 
have cut-offs for precise functional similarity 
after 40% and for functional class similarity at 
lower values. However, because the MIPS and 
GenProtEC classifications are restricted to indi- 
vidual organisms, each curve represents con- 
siderably fewer data points than do the curves 
based on the FLY + ENZYME scheme; this 
required us to "bin" the MIPS and GenProtEC 
curves in a somewhat coarser fashion. 



Discussion and Conclusion 

Here, we assessed the transfer of functional and 
structural annotation by analyzing the relation- 
ships between similarity in sequence, structure and 
function. The ~30,000 protein domain pairs of 
varying levels of similarity (at least the same fold) 
that we constructed out of the SCOP classification 
show quantitative sequence-structure relationships 
consistent with previous research. The exponential 
relationship is consistent across the secondary- 
structural classes and holds for newer probabilistic 
scoring methods. 

The sequence-function and structure-function 
relationships have not been studied as precisely 
due to the lack of a robust functional classification 
and measure of functional similarity. To overcome 
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Figure 6. Functional classification of enzymes and non-enzymes, (a) Divides the pairs by general function. There 
are three categories of pairs: (i) enzymes paired with non-enzymes (no general functional similarity), labeled ENZ/ 
~ ENZ; (ii) enzymes paired with enzymes (same general function), labeled ENZ /ENZ; and (iii) non-enzymes paired 
with non-enzymes (same general function). Pairs for which one or both domains could not be identified as enzyme 
or non-enzyme are not included in this chart. Enzymes are classified according to the EC system (Webb, 1992). The 
first component of the number represents the nature of reaction and is called class. There are six classes: oxidoreduc- 
tases, transferases, hydrolases, lyases, isomerases and ligases. The next level is subclass. It refers to the chemical 
groups on which the enzyme acts. For example, the first class, oxidoreductases, has 19 subclasses that are arranged 
according to the donor group that undergoes oxidation (CH-OH, aldehyde or oxo group, CH-CH group, etc). For 
another group of enzymes (hydrolases) subclass is determined by the nature of the bond: ester bond, peptide bond, 
etc. The next level is sub-subclass. For oxidoreductases this indicates the acceptor group: NAD(+) and NADP(-f-), or 
cytochrome; for hydrolases the sub-subclass represents the nature of substrate (carboxylic ester hydrolases, thiolester 
hydrolases, etc.). The fourth level represents a unique number for each individual enzyme, for example, 1.1.1.1: alco- 
hol dehydrogenase, (b) Shows how we adapted the functional classification of Drosophila gene products developed 
by M. Ashburner. This classification is loosely connected with FLYBASE (Ashburner & Drysdale, 1994). We used ver- 
sion 1.55 (4 August 1997) that was available from Ashburner's website: 

http : //www. ebi.ac.uk/ ~ ashburn 
The specific files that we used were taken from the ftp directory: 



ftp.ebi.ac.uk/databases/edgp/misc/ashbumer 
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this we constructed our own classification by mer- 
ging and extending the ENZYME and FLY 
schemes and assigning levels of functional simi- 
larity. Our measures of functional similarity pro- 
vide curves relating function to sequence and 
structure; when relating functional conservation to 
sequence divergence, we find distinct thresholds at 
~40% for precise function and ~25% for func- 
tional class. 

One of the interesting results that emerges from 
this is that percent identity is more useful for quan- 
tifying functional divergence than the newer prob- 
abilistic scores. In general, modern probabilistic 
scores, such as P^, are better at discriminating 
amongst highly diverged sequences (near the twi- 
light zone) than percent identity, since they better 
take into account gaps and conservative substi- 
tutions (of similar amino acids). However, for very 
similar pairs of sequences, percent identity is a 
simpler and more direct measure of divergence 
(essentially a Hamming distance). Since divergence 
in precise function takes place before that in struc- 
ture (well before the twilight zone), it is quite 
reasonable that percent identity is more successful 
at measuring the former than the latter and that 
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the converse is true for the probabilistic scores. In 
other words, percent identity is better calibrated 
for discriminating amongst very close, significant 
relationships and P™ for more distant ones. 



Practical implications 

The sequence-structure and sequence-function 
relationships described here provide practical 
information for genome annotation in terms of 
folds and functions. Table 1 summarizes the rela- 
tive advantages of the different scoring methods 
we used. Using the trends in sequence and struc- 
ture similarity, one can assess the degree to which 
structural annotation can be transferred between 
sequences at a given level of sequence similarity. 
The sequence and function similarity thresholds 
potentially establish minimum requirements of 
sequence similarity for reliable function prediction. 
Note that because the protein domain pairs con- 
sidered here all share the same fold, the numbers 
for all possible pairs will differ in the region of 
very little sequence identity, in which the sequence 
similarity is not enough to indicate the same fold. 



We refer to these as constituting the original FLY classification. Recently, the FLY classification has been superceded 
by the GO (Gene Ontology) Project classification, which merges fly, mouse and yeast annotation. Files related to the 
GO classification are available from www.geneontology.org In the original FLY classification all members of the high- 
est level are labeled 0, representatives of the next level are labeled 1, and all lower levels are labeled 2 through to 9. 
We changed the numbering scheme so that it will reflect the hierarchical nature of the classification. This 
Figure illustrates sections of the original and modified classification. The top level in the FLY classification scheme is 
called "Function primitive" (level 0) and includes five classes: "Metabolism," "Intracellular protein traffic," "Cell 
structure," "Developmental process," "Physiological process," and "Behavior." The next level after "Function primi- 
tive" is "Process" or "Molecule" (level 1 in Ashburner's classification). For "Function primitive - Metabolism" the 
processes are "Carbohydrate metabolism," "Nucleotides and nucleic acids metabolism," etc. For "Function primitive 
- Cell Structure" the "Process" can be "Nucleus," "Mitochondrion," "Membrane," etc. The next level is "Pathway" 
or "Macromolecule" (level 2 in the original classification). "Pathway" can include "Metabolic pathway," "Signaling 
pathway," or "Developmental pathway." The "Macromolecule" category includes "Protein" and "Nucleic Acid". We 
added categories to the original classification in order to classify some mammalian proteins that are widely rep- 
resented in SCOP but are absent from the original FLY scheme. These categories include immune system proteins 
(labeled "new" in (b) and respiratory proteins such as hemoglobin and myoglobin that we added to "Function primi- 
tive - Physiological process - Respiration". We call our adaptation of the original FLY scheme, FLY + . Further infor- 
mation on this adaptation is available at: 

http : //bioinfo.mbb. yale.edu/align/func 

(c) The overall hierarchy of our final scheme and identification of the different levels of similarity. If two proteins are 
both enzymes or both non-enzymes, then they possess general functional similarity. If they share the first component 
of their classification numbers, then they are in the same functional class. If they share the first three components of 
their enzyme numbers (or the equivalent for non-enzyme numbers, depending on category) then they have the same 
precise function. A significant difference between the two main branches of the hierarchy is that the levels of the 
ENZYME classification do not correspond exactly to those in the FLY+ system because the fly classification is more 
extensive than the enzyme classification. For instance, the FLY classification takes into account aspects of cellular 
(cytoskeleton, metabolic pathways, etc.) and phenotypic function (morphology, physiology, behavior) that are absent 
from the ENZYME scheme. This makes our classification of SCOP proteins somewhat unbalanced, as non-enzymes 
have much broader and more loosely defined functional classes. As a consequence, while each enzyme is assigned a 
four-component number, the length of a non-enzyme number varies, depending on the functional category to which 
it' belongs. For example, myosin is assigned a number that happens to have the same length as EC numbers: 3.12.1.1. 
However, transcription factors are numbered 1.12.9.1.1.1. We took into account this varying hierarchy depth in decid- 
ing how many components are necessary to identify precise function in each category. Note that what we mean by 
domains having the same precise function is not the same as the domains coming from the same essential protein. 
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Figure 7. Linking sequence, structure and function. We express functional similarity as the fractional percentage of 
pairs at a given level of sequence /structural similarity for which the paired domains share a precise function, func- 
tional class, or general similarity (according to our classification, see Figure 6). The following legend applies to (a) 
through (c): ( — O — ), general similarity; ( — x — ), non-enzymes with same functional class; ( — A — X enzymes 
with same functional calss; (--- x ---), non-enzymes with same precise function; and ( — A---)/ enzymes with the 
same precise function, (a) Relates functional similarity to sequence similarity in terms of percent identity. The func- 
tional similarity appears as a sharp sigmoid, with distinct thresholds of divergence for precise function, functional 
class, and general similarity. Enzymes are paired with non-enzymes only at very low percent identity, in and below 
the twilight zone (labeled TZ), At slightly higher sequence identity, pairs diverge with respect to functional class, and 
beyond 40% identity with respect to precise function. Note that 50-100% identity is not shown because almost all 
domains that are that similar share function with their counterparts, (b) Shows the same data using as the 
measure of sequence similarity. Only the divergence in precise function is visible because there is such little signifi- 
cance for the low sequence similarity at which functional class and general similarity diverge, all data points in that 
region appear near =1 or log!?™] = 0 (the y-axis). (c) Illustrates that the structure-function relation is not as 
clearly defined as that for sequence ana function. Functional similarity expressed in terms of RMS separation appears 
as a broad sigmoid curve; there are thresholds of divergence for precise function, but the divergences in functional 
class and general similarity are more gradual. The thresholds are apparent only because RMS clusters the most struc- 
turally similar pairs between scores of 0 and 0.5 A. For this reason, RMS is better at discerning functional similarity 
than S str and P str , which do not cluster the most similar pairs around a set limit, (d) Shows the same relationships 
(functional conservation versus percent identity) as in (a), except that for this graph functional similarity is determined 
in terms of the MIPS (Mewes et al, 1998) and GenProtEC (Riley, 1998b) classifications rather than the FLY - 
-f ENZYME scheme. The legend appears as the inset on the graph. We assigned MIPS and GenProtEC classifications 
to SCOP domains based on sequence comparisons to classified yeast and £. coli open reading frames (ORFs), respect- 
ively. The SCOP domain most closely matching each ORF classified in MIPS or GenProtEC was assigned the corre- 
sponding MIPS or GenProtEC function number. Only matches of 80% sequence identity or greater were considered. 
We used this SCOP domain as a functional representative; when deterrnining functional similarity, we assigned to 
SCOP domains with no MIPS or GenProtEC functional designation the function of the closest representative with at 
least 85% sequence identity, if one existed. GenProtEC functional identifiers are three-component numbers. We con- 
sider a pair of domains sharing the first component of their functional designation to be in the same functional class. 
Domains that share all three components are said to have the same precise function. For MIPS the functional desig- 
nation is not as straightforward, as one ORF can be assigned multiple functions. Therefore we consider domains 
which have at least one function in common to share functional class. Domains with all functions in common, the 
same combination of identifiers, share precise function. Because MIPS and GenProtEC each classify the proteins of a 
single organism, yeast and E. coli, respectively, these classifications can determine the functional similarities of only a 
small fraction of all our SCOP domain pairs. The data based on these classifications, appearing in (d),,are therefore 
very sparse compared to the data in (a)-(c). Despite the coarseness of the data, functional similarity based on the 
MIPS and GenProtEC classifications follows the same general relation to sequence similarity as does functional simi- 
larity based on the more comprehensive FLY -f ENZYME scheme. Vertical line indicates an approximate threshold of 
functional divergence at 40% identity. 
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Table 1. Summary of scoring methods 



Sequence similarity Structural similarity 



Features 



Limitations 



Traditional scores 



Per cent sequence 
identity 



RMS C a separation 



Alignment similarity 
scores 

Modern probabilistic 
scores 



'seq 



seq 



str 



Well understood, in use; 
percent identity better for 
looking at functional 
similarity 

Analogous similarity scores, 
S str depends most highly 
on best matches 

Statistical significance, 
unified framework for 
different comparisons 



RMS depends most highly on 
worst matches, requiring 
arbitrary trimming; percent 
identity is insensitive to gaps 
and conservative substitutions 
Dependence on alignment 
length 

Not as familiar as RMS and 
percent identity 



The Table lists the schemes presented here for characterizing the sequence-structure relationship, along with their relative advan- 
tages and disadvantages. 



Practically, then, when one searches an unchar- 
acterized open reading frame against known struc- 
tures, if the open reading frame matches a 
structure with a good e-value or percent identity, 
then the curves presented here can be used to 
check how the functional and detailed structure 
annotation will transfer. For example, if an 
unknown open reading frame matches a PDB 
structure with an e-value of 0.001 and a percent 
identity of 30%, then one can be assured that it 
has the same fold (Brenner et al, 1998) and accord- 
ing to our analysis it has a two-thirds chance of 
having the same exact function. Furthermore, it 
has a ~99 % chance of having the same functional 
class and its structure probably diverges from the 
known structure by a trimmed RMS of less than 
0.7 A. 

Future directions 

There are a number of directions in which we 
might extend this analysis. With respect to the 
sequence-structure relation, we can reduce the 
overrepresentation of the immunoglobulins and 
improve the calculation of P str (by redoing the fit 
to the extreme value distribution reported by 
Levitt & Gerstein (1998) to eliminate residual 
length-dependency. 

In the functional realm, we can investigate if and 
how the sequence-function and structure-function 
relationships vary for different categories of pro- 
teins. For example, although we found consistency 
of the sequence-structure relationship among sec- 
ondary structural classes, Hegyi & Gerstein (1999) 
found that the distribution of enzymes and non- 
enzymes varies with secondary structural class. 
A related issue is that of conformational changes. 
It is conceivable that among domains with very 
similar sequences but structures that differ by a 
conformational change, function is less conserved 
than it is among similar sequences with more simi- 
lar structures. 

Perhaps the most important direction in which 
to further this work is the augmentation of the 
functional classification. With the growing 



amount of fully sequenced genomes there is a 
need for the development of a comprehensive 
system for functionally classifying proteins, a 
complete classification for the entire universe of 
protein functions. It will be a difficult process, 
as many existing organism-specific classifications 
will have to be merged, but the end result will 
have the advantage of not being biased towards 
any one organism. Such a universal classification 
will allow much more reliable transfer of func- 
tional annotation. 
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ABSTRACT A related DNA fragment distinct from the 
epidermal growth factor receptor and ERBB 2 genes was de- 
tected by reduced stringency hybridization ofv-erbB to normal 
genomic human DNA. Characterization of the cloned DNA 
fragment mapped the region of v-erbB homology to three exons 
with closest identity of 64% and 67% to a contiguous region 
within the tyrosine kinase domains of the epidermal growth 
factor receptor and BRBB2 proteins, respectively. cDNA clon- 
ing revealed a predicted 148-kDa transmembrane polypeptide 
with structural features identifying it as a member of the ERBB 
gene family, prompting us to designate the gene as ERBB 3. It 
was mapped to human chromosome I2qI3 and was shown to be 
expressed as a 6.2-kUobase transcript in a variety of normal 
tissues of epithelial origin. Markedly elevated ERBB3 mRNA 
levels were demonstrated In certain human mammary tumor 
cell lines. These findings suggest that increased ERBB 3 expres- 
sion, as in the case of epidermal growth factor receptor and 
ERBB2, may play a role In some human malignancies. 

Protooncogenes encoding growth factor receptors constitute 
several distinct families with close overall structural homol- 
ogy. The highest degree of homology is observed in their 
catalytic domains, essential for the intrinsic tyrosine kinase 
activity of these proteins (1). Examples of such families 
include genes encoding epidermal growth factor receptor 
(EGF-R) and ERBB2, the colony-sti mutating factor 1/ 
platelet-derived growth factor receptors, insult n/insulin-like 
growth factor 1 receptors, and EPH/ELK (2-12). Growth 
factor receptors in several of these families play critical roles 
in regulation of normal growth and development. Some of 
these molecules have been implicated in the neoplastic pro- 
cess as well. In particular, both the EGF-R gene and ERBB2 
have been shown to be activated as oncogenes by mecha- 
nisms involving overexpression or mutations that constitu- 
ttvely activate the catalytic activity of their encoded proteins 
(13-16). Thus, we undertook the present studies in an effort 
to identify and isolate additional members of the ERBB 
protooncogene family. 

MATERIALS AND METHODS 

Human Cells. Mammary epithelial cells AB589 (17) and 
immortalized keratinocytes RHEK (18) were provided by M. 
Stampfer (Lawrence Berkeley Laboratory) and J. Rhim, 
respectively. Normal human epidermal melanocytes 
(NHEM) and keratinocytes (NHEK).were obtained from 
Clbnetics (San Diego, CA). Sources for human embryo 
fibroblasts (19) or mammary tumor cell lines (20) have been 
described. 
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DNA and RNA Hybridization. High-stringency hybridiza- 
tion was conducted as described {20). Reduced-stringency 
hybridization of DNA was carried out in 30% (vol/vol) 
formamide followed by washes in 0.6x SSC, whereas inter- 
mediate stringency was achieved by hybridization in 40% 
(vol/vol) formamide and washing in 0.25 x SSC. 

Molecular Cloning. An oligo(dT)-primed human placenta 
cDNA library was obtained from Clontech. The oligo(dT) 
primed MCF-7 cDNA library was constructed in ApCEV9 
(21). After plaque purification, phage DNA inserts were 
subcloned into pUC-based plasmid vectors for further char- 
acterization. 

Nucleotide and Amino Acid Sequence Analysis* The nucle- 
otide sequence was determined for both DNA strands by the 
dideoxy chain-termination method (22) using supercoiled 
plasmid DNA as template. * Amino acid sequence compari- 
son was performed with the alignment program by Pearson 
and Lipman<23). Hydrophobic and hydrophilic regions in the 
predicted protein were identified according to Kyte and 
Doolittle (24). 

RESULTS 

Identification of a Third Member of the ERBB Protoonco- 
gene Family. In an effort to detect novel £/t&B-related genes, 
human genomic DNA was cleaved with a variety of restric- 
tion endonucleases and subjected to Southern blot analysis 
with v-erbB as probe. Under reduced stringency hybridiza- 
tion, four Sac I restriction fragments were detected. Two 
were identified as EGF-R gene fragments by their amplifi- 
cation in MDA-MB468 cells (Fig. 1A, lanes 1 and 2) known 
to contain EGF-R gene amplification and one as an ERBB2* 
specific gene fragment due to its increased signal intensity in 
EftB^-amplified SK-BR-3 cells (Fig. 1A, lanes 1 and 3). 
However, a single 9-kbp Sac I fragment exhibited equal 
signal intensities in normal human thymus, A431, and SK- 
BR-3 DNA (Fig. 1A). When the hybridization stringency was 
raised by 7°C, this fragment did not hybridize, whereas 
EGF-R and £/?BB2-specific restriction fragments were still 
detected with v-erbB as a probe (Fig. IB). Taken together, 
these findings suggested the specific detection of another 
v-erW*-related DNA sequence within the 9-kbp Sac I frag- 
ment. 

For further characterization we prepared a normal human 
genomic library from Sac I-cleaved thymus DNA enriched 
for 8- to 12-kbp fragments. Ten recombinant clones detected 
by v-erbB under reduced stringency conditions did not hy- 
bridize with human EGF-R or ERBB2 cDNA probes at high 
stringency. As shown in the restriction map of a repre- 

Abbreviation: EGF, epidermal growth factor. 
*The ERBB3 nucleotide sequence has been deposited in theQenBank 
data base (accession no. M29366). 
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sentative clone with a 9-kbp insert, the region of v~erbB 
homology was localized by hybridization analysis to a 1.5- 
kbp segment spanning from the EcoRl to the downstream Pst 
I site. Nucleotide sequence analysis revealed that this region 
contained three open reading frames bordered by splice 
junction consensus sequences (Fig. 2). The predicted amino 
acid sequence of these three open reading frames revealed 
the bigbest identity scores of 64-67% to three regions that are 
continuous in the tyrosine kinase domains of v~erbB, as well 
as human EGF-R and ERBB2 proteins. Furthermore, all 
splice junctions of the three characterized exons in the gene 
were conserved with ERBB2. Amino acid sequence homol- 
ogy to other known tyrosine kinases was significantly lower, 
ranging from 39-46%.. 

A single 6. 2-kb- specific mRNA was identified by Northern 
(RNA) blot analysts of human epithelial cells by using the 
150-base pair (bp) Spe l-Acc I exon-contaming fragment as 
probe (Fig. 2). Under the stringent hybridization conditions 
used, this probe detected neither the 5-kb ERBB2 mRNA nor 
the 6- and 10-kb EGF-R mRNAs (data not shown). All of 
these findings suggested that we had identified an additional 
functional member of the ERBB protooncogene family, 
which we tentatively designated as ERBB3. 

Close Structural Similarity of the Predicted ERBB3 Protein 
with Other ERBB Family Members. In an effort to charac- 
terize the entire ERBB3 coding sequence, overlapping cDNA 
clones were isolated from oligo(dT)- primed cDNA libraries 
from sources with known ERBB3 expression, utilizing gene- 
specific genomic exons or cDNA fragments as probes. The 
clones were initially characterized by restriction analysis and 
hybridization to the mRNA and were subsequently subjected 
to nucleotide sequence analysis. The clones pE3-8, pE3-9, 
pE3-ll t and pE3-16 contained identical 3 ' ends terminating in 
a poly(A) stretch (Fig. 2). 

The complete coding sequence of ERBB3 was contained 
within a single long open reading frame of 4080 nucleotides 
extending from position 46 to an in-frame termination codon 
at position 4126. The most upstream ATG codon at position 
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Fig. 1. Detection of v-erifi-rclated gene fragments in normal 
human thymus (lame 1), MDA-MB468 (lane 2), SK-BR-3 (lane 3), 
DNAs were restricted with Sac I and hybridized with a v- 
«rWJ-specific probe spanning from the upstream BamHl to the £a>Rl 
site in avian erythroblastosis proviral DNA (25). Hybridization was 
conducted at reduced (A) or intermediate (Bj stringency conditions. 
The arrow denotes a 9-kbp £/?fli?-related restriction fragment dis- 
tinct from those of EGF-R and ERB82. 
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Fio. 1. Genomic and cDNA cloning of ERBB3. The region of 
v-erbB homology within the genomic 9-kbp Sue I insert of Ae3-1 was 
subcloned into pUC (pe3-l) and subjected to nucleotide sequence 
analysis. The three predicted exons are depicted as solid boxes, 
ERBB3 cDNA clones were isolated from normal human placenta 
(shaded bars) and MCF-7 (open bar) oligo(d1>primed libraries. The 
entire nucleotide sequence was determined for both strands on 
ERBB3 cDNA from normal human placenta and upstream of the 5' 
Xho I sire on pE3-X6. The coding sequence is shown as a solid bar, 
and splice junctions of the three characterized genomic exons are 
indicated by vertical white lines. Solid lines in the cDNA map 
represent untranslated sequences. Restriction sites: A, Acc I; Av, 
Ava I; B T BamUU Bg, Bgl H; E, EcoRl; H, Hindlil; K;Kpn I; M, 
Mst II; P, est I; S, Sac I; Sm, Strut [; and Sp, Spe I 

100 was the likely initiation codon, as it was preceded by an 
in-frame stop codon at nucleotide position 43 and fulfilled 
Kozak's criteria (26) for an authentic initiation codon. The 
open reading frame comprised 1342 codons, predicting a 
148-kDa polypeptide. As shown in Fig. 3, the deduced amino 
acid sequence of the ERBB 3 polypeptide predicted a trans- 
membrane receptor tyrosine kinase most closely related to 
EGF-R and ERBB2. A hydrophobic signal sequence of 
ERBB3 was predicted to comprise the 19 amino-terminal 
amino acid residues. Cleavage of this signal sequence be- 
tween Gly-19 and Ser-20 would generate a processed poly- 
peptide of 1323 amino acids with an estimated molecular 
mass of 145 kDa. A single hydrophobic membrane-spanning 
domain encompassing 21 amino acid's was identified (Fig. 3). 
. The putative ERBB3 Ugand-binding domain was 43% and 
45% identical in amino acid residues with the predicted 
ERBB2 and EGF-R protein, respectively. Within the extra- 
cellular domain, all 50 cysteine residues of the processed 
ERBiB3 polypeptide were conserved and similarly spaced 
when compared with the EGF-R and ERBB2. Forty-seven 
cysteine residues were organized in two clusters containing 
22 and 25 cysteines, respectively, a structural hallmark of this 
tyrosine kinase receptor subfamily (2-4). Ten potential N- 
linked glycosylatton sites were localized within the.ERBB3 
extracellular domain. In comparison with the EGF-R and 
ERBB2 proteins, five and two of these glycosylatton sites 
were conserved, respectively. Among these, the site proxi- 
mal to the transmembrane domain was conserved among all 
three proteins (Fig. 3). 



Biochemistry: Kraus et ai 



Proc, Natl Acad ScL USA 86 (1989) 9195 



Within the cytoplasmic domain, a core of 277 amino acids 
from position 702-978 revealed the most extensive homology 
with the tyrosine kinase domains of EGF-R and ERBB2, In 
this region 60 or 62% of amino add residues were identical and 
90 or 89% were conserved, respectively. This stretch of amino 
acid homology coincides with the minimal catalytic domain of 
tyrosine kinases (1). There was significantly Lower homology 
with other tyrosine kinases (Fig. 3). The consensus sequence 
for an ATP-binding site Cly-Xaa-Gly-Xaa-Xaa-Gty (I) at 
amino acid position 716-721 as well as a lysine residue located 
21 amino acid residues farther carfaoxyl- terminal were con- 
served between the three ERBB-related receptors. Taken 
together, these findings defined the region between amino acid 
position 702 and 978 as the putative catalytic domain of the 
ERBB3 protein (Fig. 3). 

The most divergent region of ERBB3 compared with either 
EGF-R or ERBB2 was its carboxyl terminus, comprising 364 
amino acids. Tyrosine residues at positions 1197, 1199, and 
1262 matched closest with the consensus sequence for puta- 
tive phosphorylation sites (28). The peptide one-letter se- 
quence YE YMN, encompassing Tyr-1197 and TyM199, was 
repeated at positions 1260-1264 and was at both locations 
surrounded by charged residues. These observations render 
Tyr-1197, Tyr-1199, and Tyr-1262 ltkejy candidates for au- 
tophosphorylation sites of the ERBB3 protein. 



Chromosomal Mapping of Human ERBB3. We determined 
the chromosomal location of the ERBB3 gene by in situ 
hybridization (29) with an 3 H-labeled plasmid containing the 
ERBB3 amino-terminal coding sequence. A total of 110 
human chromosome spreads were examined prior and sub- 
sequent to G banding for identification of individual chro- 
mosomes, One hundred forty-two grains were localized on a 
400-band ideogram. We observed specific labeling of chro- 
mosome 12, where 38 out of 51 grains were localized to band 
ql3 (data not shown). Thus, the genomic locus ofERBB3 was 
assigned to 12ql3. In this region of chromosome 12, several 
genes have previously been mapped including the melanoma- 
associated antigen ME491 (30), histone genes (31) and the 
gene for lactalbumin (32). In addition, two protooncogenes, 
INT1 (33) and GU (34), are located in close proximity to 
ERBB3. 

ERBB3 Expression in Normal and Malignant Human Cells. 
To investigate its pattern of expression, we surveyed a number 
of human tissues for the ERBB3 transcript. The 6.2-kb 
ERBB3-specific mRNA was observed in term placenta, post* 
natal skin, stomach, lung, kidney, and brain, but it was not 
detectable in skin fibroblasts, skeletal muscle, or lymphoid 
cells (data not shown). Among the fetal tissues analyzed, the 
ERBB3 transcript was expressed in liver, kidney, and brain 
but not in fetal heart or embryonic lung fibroblasts. These 



i 

61 

in 

181 
241 
301 
361 
421 
481 
S41 
601 

661 
721 
781 
641 
901 
961 
1021 
1081 
H41 
1201 
1261 
1321 



nnmiOALguL cLtrsLfiRG 



8 ifNf^QiNVGF.; m ^ 'WineBi^yti ■ 





$g^aj^^»^^^^R esgpgiapgp JCPHGrmoaa. eeveuwei*) udldtleaked 

KLATTTLQSA LSLFTOCTMA PRGSQSULSF SSGXMPMNQG KUSESOQEiaA V2GSSERCPR 
FVSLHFMPWS CLASEl$SBGa VTGSEAEI>QX: KV2MCSISRSR SRSPKPftSDS AXHSQRHSI^ 
VPWPZiSPPG t.l l-l 1 iiTKBftw MPDTKUCjQTP SSRE&TLSSV Ol-SSVIiGTEE KDEDEEYEYM 
HR RRR BSPPH Pf*U?SSI<£EIi GYEYMUVGSO Z3ASUGSTQS CPZ«S r VPIME» TAGTTPOKDY 
EYftfttftQRDGa GPGGDXAAMG ACPASEQGYE EKRAFQaPGH OAPHVHYAIUi KTZ^RSXJEATD 



extracellular 



cytoplasmic 



erbB-3 

erbB-2 

met 
tph 

m 

/ms/CSFIR 




: ' ./^'v; '.• : '*.". ' " ! ; • <^ : ^^ ' ::*: ' : . ; ••*■"■< • 

: < tf' . i-. .j.. 1 }; .*'■:;■■•• •■. 

• v r - . • : • . , , J*p : . : ; .7;*. 1-. 

..... r •„■■•••.'/>■■.■.'• ;'.;:.'.!;.■;■ 

••i.-*-*-', J; . " »*•'•*.*•->,;- ; •-•*: 

■ • ;V;;; : -;;-:22v : :::.;;:;;-;\:f r ;; 
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Fig. .3. (Upper) Predicted amino acid 
sequence of the ERBB3 polypeptide (as de- 
duced from the Gen Bank sequence M29366) 
and comparison with other receptor-like ty- 
rosine kinases. The amino acid sequence is 
shown in single-letter code and is numbered 
at left. The putative extracellular domain 
(light shading) extends between the pre- 
dicted signal sequence (solid box) at the 
amino terminus and a single hydrophobic 
transmembrane region (solid box) within the 
polypeptide. The putative ATP-binding site 
at the amino terminus of the TK domain is 
circled. Potential autophosphorylation sites 
within the carboxyl-terminal domain 
(COOH) are indicated by asterisks. Potential 
N-linked glycosylation sites (m— ) are 
marked above the amino acid sequence. 
(Lower) The two cysteine clusters (Cys) in 
the extracellular domain and the predicted 
tyrosine kinase domain (TK) within the cy- 
toplasmic portion of the polypeptide are out- 
lined by dark shading. The percentage of 
amino acid homology of ERBB3 in individ- 
ual domains with ERBB2 <4), EGF-R (2), 
MET (27), EPH (11), insulin receptor (IR 
C9)l r and FMS (5) is listed below. Less than 
\€% identity is denoted by -. 
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observations indicated the preferential expression of the 
ERBB3 transcript in epithelial tissues and brain. 

We also investigated ERBB3 expression in individual celt 
populations in comparison to EG F- Rand ERBB 2 transcripts. 
As shown in Table 1, mRNA levels of each were relatively 
high in kerattnocytes and low but similar in cells derived from 
glandular epithelium. These findings are consistent with 
growth regulatory roles of all three receptor-like molecules in 
squamous and glandular epithelium. Whereas ERBB2 and 
EGF-R transcripts were also readily seen in normal fibro- 
blasts, the same cells lacked detectable ERBB3 mRNA. In 
contrast, normal human melanocytes, which expressed both 
ERBB3 and ERBB2 at levels comparable with human kera- 
tinocytes, lacked detectable EGF-R transcripts. Thus, the 
expression patterns of these receptor-like molecules were 
different in specialized cell populations derived from epider- 
mal tissues. 

The ERBB3 transcript was detected in 36 of 38 carcinomas 
and 2 of 12 sarcomas, whereas 7 tumor cell lines of hemato- 
poetic origin lacked measurable ERBB3 mRNA. Markedly 
elevated levels of a normal-sized transcript were observed in 
6 of 17 tumor cell lines derived from human mammary 
carcinomas. By Southern blot analysis, neither gross gene 
rearrangement nor amplification was detected in the cell lines 
{data not shown). Fig. 4A shows the results of Northern blot 
analysis with control AB589 nonmalignant human mammary 
epithelial cells {lane 1) and two representative human mam- 
mary tumor lines, MDA-MB415 (lane 2) and MDA-MB453 
(lane 3). Hybridization of the same filter with a human /3 actin 
probe (Fig. 4B) verified levels of mRNA in each lane. 
Densitometric scanning indicated that the ERBB3 transcript 
in each tumor cell line was elevated more than 100-fold above 
that of the control cell line. Thus, overexpression of this third 
member of the ERBB family, as for the EGF-R and ERBB2 
genes, may play an important role in some human malignan- 
cies. 

DISCUSSION 

In the present report, we describe the identification of a third 
member of the ERBB /EOF receptor family of membrane- 
spanning tyrosine kinases and the cloning of its MMength 
coding sequence. This gene, designated ERBB3, encodes a 
predicted protein with striking structural similarities to other 
members of this family. These features include overall size, 
extracellular domain with two signature cysteine clusters, 
and its uninterrupted tyrosine kinase domain, exhibiting with 
81% and 83% significantly greater overall similarities to 
EGF-R and ERBB 2 products than to any other tyrosine 
kinase. The structural relatedness of its extracellular domain 
with that of the EGF-R raises the possibility that one or more 
of an increasing number of EGF-like ligands (36) may interact 
with the ERBB3 product. 

Distinct regions within the predicted ERBB3 coding se- 
quence revealed relatively higher degrees of divergence. For 
example, its carboxyl- irminal domain failed to exhibit sig- 
nificant colinear identity scores with either ERBB2 or EGF-. 
R. Within the tyrosine kinase domain, which represents the 
most conserved region of the predicted ERBB 3 protein, a 
short stretch of 29 amino acids carboxy I- terminal to the 
ATP-binding site differed from regions of the predicted 
ERBB 2 and EGF-R coding sequence in 28 and 25 positions, 
respectively. Such regions of higher divergence in their 
cytoplasmic domains may confer functional specificity to 
these closely related receptor-Uke molecules. 

Chromosomal mapping localized ERBB3 to human chro- 
mosome I2qll-13, whereas the related EGF-R and ERBB2 
genes are located on chromosomes 7pl2-13 (37) and 17pl2- 
21,3 (3, 29), respectively. Thus, each appears to localize to 
regions containing different respective homeobox (38 1 39) 



Table 1. Normal expression pattern of human ERBB gene 
family members 



Relative transcript levels 



Source 


ERBB3 


ERBB2 


EGF-R 


Embryonic fibroblast (M426) 




+ 


+ . 


Skin fibroblast (501T) 




+ 




Immortal keratinocyte (RHEK) 








Primary keratinocyte (NHEK) 


+ 


+ 


+ + 


Glandular epithelium (AB589) 






(+) 


Melanocyte (NHEM) 


+ + 


+ + 





Replicate Northern blots were hybridized with equal probe counts 
of similar specific activity for ERBB3, ERBB2, and EGF receptor, 
respectively. Relative signal intensities were semiquantitatively es- 
timated: not detectable; (+), weakly positive; +, positive; ++, 
strongly positive. 

and collagen gene (40) loci. Keratin type I and type II genes 
also map to regions of 12 and 17 (41, 42), consistent with 
localization of ERBB3 and ERBB2, respectively. 

Recent studies in Drosophiia have emphasized how critical 
and multifunctional are developmental processes mediated 
by ligand-receptor interactions. An increasing number of 
Drosophiia mutants with often varying phenotypes have now 
been identified as being due to lesions in genes encoding such 
proteins (43 , 44) including the Drosophiia EGF-R homo- 
jogue, designated DER. It is not yet known whether DER is 
the Drosophiia counterpart of all three mammalian ERBB 
genes. If so, functions assigned to DER may eventually be 
associated with one or more of the divergent mammalian 
ERBB genes as well as other functions that have evolved in 
more complex mammalian organisms. 

There is evidence for autocrine {45, 46) as well as paracrine 
(19, 47) effectors of normal cell proliferation. However, the 
inherent transforming potential of autocrine growth factors 
(48, 49) suggests that growth factors most commonly act on 
their target ceil populations by a paracrine route. Our survey 
ofERBB3 gene expression indicated its normal expression in 
cells of epithelial and neuroectodermal derivation. Compar- 
ative analysts of the three ERBB receptor-like genes in 
different cell types of epidermal tissue revealed that keratt- 
nocytes expressed all three genes. In contrast, melanocytes 
and stromal fibroblasts specifically lacked EGF-R and 
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F\G. 4. Elevated ERBB3 transcript levels in human mammary 
tumor cell lines. A Northern blot containing 10 jig of total cellular 
RNA from AB589 mammary epithelial cells (lane 1), as well as 
MDA-MB415 (lane 2) and MDA-MB453 (lane 3) mammary tumor cell 
lines was hybridized with an ERBB3 cDNA probe 0*). After signal 
decay the same blot was re hybridized with a human p actin cDNA 
probe (35). , 
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ERBB3 transcripts, respectively. Thus, melanocytes and 
stromal fibroblasts may be sources of paracrine growth 
factors for EGF-R and ERBB3 products, respectively, that 
are expressed by the other cell types residing in close 
proximity in epidermal tissues. 

To date, both ERBB/EGfR and ERBB2 have been causally 
implicated in human malignancy. EGF-R gene amplification 
and/or overexpression in tumors has been demonstrated in 
squamous cell carcinomas and glioblastomas (50). ERBB2 
amplification and /or overexpression has been observed in 
human breast and ovarian carcinomas (51, 52), and ERBB2 
overexpression has been reported to be an important prog- 
nostic indicator of particularly aggressive tumors (52). Thus, 
our present findings that the ERBB3 transcript is overex- 
pressed in a significant fraction of human mammary tumor 
cell lines raises the possibility that this new member of the 
ERBB/EGF receptor family may also play an important role 
in some human malignancies. 

We thank Drs. D. Ron, G. Kruh, and P. Finch for providing some 
of the cellular RNA samples used in the initial ERBBJ expression 
analysis. 
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The cytokine signaling pathways that activate the 
Janus family of tyrosine kinases ( Jaks) and the "signal 
transducers and activators of transcription" (Stats) 
have been well characterized in mammalian systems. 
Work shown here provides evidence that an analogous 
signaling pathway exists in Drosophila melanogaster. 
Because many of the ligand-receptor pairs in Drosophila 
have not been fully characterized, it was necessary to 
bypass the receptor stimulation event that normally 
triggers intracellular Jak/Stat activation. This was done 
by treating Drosophila Schneider 2 cells with vanadate/ 
peroxide, which has been shown to closely mimic some 
signaling events triggered by interferon y, including the 
activation of Jakl, Jak2, and the Static* protein. Evi- 
dence presented here demonstrates that vanadate/per- 
oxide can induce a y response region binding complex in 
Drosophila Schneider 2 cells. This complex contains two 
phosphoproteins of 100 and 150 kDa, respectively, and 
shares many features with the vanadate/peroxide-stim- 
ulated binding complex in the mammalian system. 
Southern blot analysis of genomic DNA using the src 
homology domain 2 (SH2) of Stat la confirms the pres- 
ence of a related gene in the Drosophila genome. 



The Jak/Stat 1 pathway is activated in mammalian cell sys- 
tems by treatment of cells with a number of different cytokines 
and growth factors (1, 2). The Stat family of transcription 
activators has several common structural features, including 
conserved SH2 domains. The current model of Stat activation is 
that a tyrosine in the carboxyl terminus of the Stat protein is 
phosphorylated and acts as an SH2 binding site upon cytokine 
stimulation (3, 4). The Stat can then form dimers, either with 
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sis; VBC, vanadate/peroxide-induced binding complex. 



itself or with another member of the Stat family. Dimerization of 
the Stats is necessary for DNA binding and, ultimately, induction 
of transcription. Activation of a transcription complex containing 
phosphorylated Stats can also be induced by treating cells with a 
combination of sodium orthovanadate and hydrogen peroxide in 
the absence of cytokines or growth factors (5). 

There is growing evidence that a Jak/Stat pathway may exist 
in Drosophila. First, hopscotch, a Drosophila Jak kinase homo- 
logue has been cloned and characterized (6). hopscotch is a ma- 
ternal transcript, expressed in embryonic stripes. Mutations in 
this gene locus cause abnormalities in the expression patterns of 
some of the pair-rule and segment-polarity genes. This implicates 
the Jak kinase being involved in signal transduction pathways 
controlling segmentation of the developing fly. 

Second, a number of growth factor receptor homologs have 
been cloned from Drosophila. torso encodes a receptor tyrosine 
kinase that has a cytoplasmic domain similar to that of the 
mammalian platelet-derived growth factor receptor and is in- 
volved in embryonic pattern formation (7, 8). der, the epidermal 
growth factor receptor homolog, has been cloned and genetically 
characterized (9). Mutations in the der locus cause pleiotropic 
effects, implicating it in the development of a wide range of 
tissues (10). Treatment of mammalian cells with either platelet- 
derived growth factor or epidermal growth factor has been shown 
to activate Jak/Stat pathways, suggesting that Torso or Der may 
activate a Jakl Stat pathway in Drosophila as well. 

Using the ligand-independent activation by vanadate/perox- 
ide treatment, we have attempted to identify a Stat-like activ- 
ity in Drosophila Schneider 2 cells. The specific GRR binding 
activity, which was seen after cell treatment, was dependent 
upon tyrosine phosphorylation. This binding complex con- 
tained phosphoproteins of 100 and 150 kDa. Detection of 
Stat la-related sequences in the Drosophila genome also sug- 
gested that a Stat homolog exists in Drosophila. 

MATERIALS AND METHODS 

Cells and Reagents—Schneider 2 cells (ATCC CRL 1963) were main- 
tained in Schneider's Drosophila medium (Life Technologies, Inc.), 10% 
fetal bovine serum (Quality Biologies). Sodium orthovanadate and hy- 
drogen peroxide were purchased from Sigma. SDS-polyacrylamide gel 
electrophoresis was performed on the Novex system. 4G10 antibody was 
purchased from Upstate Biotechnology Inc. All other reagents were 
purchased from commercial sources unless otherwise noted. 

Schneider 2 Cell Treatment— A solution of 50 mM sodium orthovana- 
date, 500 mM hydrogen peroxide made in Schneider medium was incu- 
bated at 24 °C for 5 min. This solution was added to exponentially 
growing Schneider 2 cells to a final concentration of 100 /jm sodium 
orthovanadate, 1 mM hydrogen peroxide, and cells were incubated for 
the indicated times at 24 °C. Cells were washed two times in phosphate- 
buffered saline. Whole cell lysates were prepared by solubilizing cells 
for 10 min, on ice, with intermittent vortexing, in a buffer containing 20 
mM HEPES, pH 7.0, 10 mM KC1, 1 mM MgCl 2 , 20% glycerol, 0.1% 
Nonidet P-40, and 1% Triton X-100. Particulate matter was separated 
from soluble material by centrifugation at 18,000 x g for 5 min. Where 
indicated, cells were pretreated with 500 nM staurosporin or 30 pg/ml 
genistein for 30 min. Vanadate/peroxide solution (described above) was 
added to cells, and the incubation continued for an additional 60 min. 

Electrophoretic Mobility Shift Assays— 10 jtg of soluble protein was 
diluted in binding buffer to a final concentration of 10 mM Tris, pH 7.4, 
5 mM MgCl 2 , 100 mM KC1, 1 mM dithiothreitol, 50% glycerol, 0.03% 
Nonidet P-40, and 0.1 mg/ml poly(dl-dC). Double-stranded probe was 
labeled with [y* 32 P!ATP using polynucleotide kinase. 1 ng of probe was 
added to the extract. A 10- or 50-fold excess of unlabeled competitor 
oligonucleotide was added as indicated. For sequences of oligonucleo- 
tides used, see Fig. 2D. Binding reactions proceeded for 5 min at room 
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Fig. 1. Vanadate/peroxide-inducible GHR binding complex in 
Drosophila Schneider 2 cells. Whole cell lysates were made from 
untreated Schneider 2 cells (lane J), cells treated with vanadate/perox- 
ide for increasing amounts of time (lanes 2-8), or cells treated for 1 h 
with either hydrogen peroxide alone {lane 9) or vanadate alone (lane 
70). Whole cell lysates from these cells were analyzed by etectrophoretic 
mobility shift assay (as described under "Materials and Methods"). 

temperature. Where indicated, binding reactions were preincubated for 
30 min with 5 nut phenyl phosphate, 5 m.M sodium phosphate, or 
Yersinia protein tyrosine phosphatase in the presence or absence of 1 
KiM sodium ortho vanadate or 3 mM sodium tungstate. Binding was 
analyzed on a 6% non-denaturing gel containing 2.5% glycerol in 0.25 x 
TBE (22.5 mM Tris, 22.2 mM boric acid, 0.5 mM EDTA). Gels were dried 
and subjected to autoradiography. 

Protein Purification and SDSPAGE Analysis — Soluble celt lysate 
from activated or control Schneider 2 cells was allowed to bind to 
heparin-agarose (Sigma) for 1 h at 4 °C. The agarose was washed with 
cell lysis buffer, and the van a date/peroxide-induced binding complex 
(VBC) was eluted with a salt gradient from 0 to 300 mM NaCl in cell 
lysis buffer. Biotinylated GRR oligonucleotide was bound to streptavi- 
din-agarose according to standard protocols. Partially purified VBC was 
allowed to incubate with GRR-agarose in the presence of 200 ^tg/ml 
salmon sperm DNA (Digene Diagnostics, Inc.) for 1 h at 4 °C. Affinity 
beads were washed 3 times with cell lysis buffer. Protein was solubi- 
lized using SDS-PAGE sample buffer. After denaturation, proteins 
were separated on a 4-20% acrylamide gel and transferred to polyvi- 
nyl idene di fluoride membranes (Immobilon, Millipore) by electro- 
blotting. Blots were subjected to standard Western blotting procedures 
using biotinylated 4G10 anti-phosphotyrosine, streptavi din-conjugated 
horseradish peroxidase, and ECL (Amersham Corp.) for detection. 

Southern Analysis — 10 fig of normal human genomic DNA, 10 /xg of 
Drosophila genomic DNA, or 2.5 /-ig of Drosophila genomic DNA was 
digested using EcoRl. Southern blot analysis, at a stringency reduced 
by .14 °C, was performed as described (11). The probe was generated by 
the polymerase chain reaction using oligonucleotide primers (5'-TACT- 
GTGTTC ATCATACTGTC-3 ' and 5'-TGGAATGATGGATGCATCAT- 
GGGCTT-3'), spanning nucleotides 1913-2446 of human Static*, and 
labeled by nick translation. 

RESULTS AND DISCUSSION 

Vanadate I Peroxide Treatment of Drosophila Schneider 2 
Cells Induces Binding of a Speci fic Complex to the GRR of the 
Fc Receptor Promoter — We have previously shown that vana- 
date/peroxide will mimic the action of interferon 7 on mono- 
cytes (5). We have also detected vanadate/peroxide stimulation 
of GRR-binding proteins in a number of different cell types, 
including HeLa, U266, THP1, U937, Daudi, and fibroblast 
lines. 2 Because Drosophila interferons have yet to be identified, 
we activated potential Stat-like proteins with vanadate/perox- 
ide, thereby bypassing a ligand/receptor interaction. Treat- 
ment of Schneider 2 cells with vanadate/peroxide resulted in 
the formation of a complex that specifically bound to the GRR 
in a time-dependent manner (Fig. 1). Cells required 20 min of 
exposure to vanadate/peroxide for assembly of the complex as 
measured by electrophoretic mobility shift assays. DNA bind- 
ing activity increased for up to 120 min of incubation (Fig. 1, 

2 A. C. Lamer and D. S. Finbloom, unpublished data. 
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FlG. 2. Vanadate/peroxide-inducible DNA binding complex 
binds specifically to GRR and GAS elements. A, van ad ate/perox- 
ide-induced GRR binding complexes formed in the presence of a 10-fold 
(lane 2) or 50-fold (lane 3) excess of unlabeled GRR and a 10-fold (lane 
4) or 50-fold (lane 5) excess of unlabeled GAS. B> vanadate/peroxide- 
induced GRR binding complexes formed in the presence of a 50-fold 
excess of unlabeled SIE {lane 2\ a 50-fold excess of unlabeled ISRE 
(lane 3), or a 50-fold excess of unlabeled AP-1 (lane 4). C, vanadate/ 
peroxide-induced GAS binding complexes formed in the presence of a 
10-fotd (lane 2) or 50-fold (lane 3) excess of unlabeled GAS and a 10-fold 
(lane 4) or 50-fold (lane 5) excess of unlabeled GRR. D } comparison of the 
enhancer sequences used for electrophoretic mobility shift assays and 
competitions. A consensus sequence was determined by comparing se- 
quences that competed for the GRR binding complex. 



lanes 1-8). Longer incubation times did not increase the 
intensity of the shift (data not shown). The intensity of the 
Drosophila shift was similar to the intensity of shifts de- 
tected in mammalian cells; however, the Drosophila shift 
complex migrated slightly faster on the native gel than did 
the mammalian shift complex (data not shown). The binding 
activity was induced in the presence of cycloheximide sug- 
gesting that new protein synthesis was not necessary for 
activity (data not shown). These findings suggest that the 
appearance of DNA binding activity is an early step in the 
activation of this pathway. 

To determine whether the induced complex bound specifi- 
cally to the GRR, we performed competition experiments using 
specific unlabeled oligonucleotides. A 10-fold excess of either 
unlabeled GRR or a closely related y-activated sequence (GAS) 
from the IRF1 promoter competed for labeled GRR binding 
(Fig. 2A, lanes 2 and 4), while only a slight decrease in binding 
was observed when a 50-fold excess of a high affinity SIE (13) 
was added (Fig. 2B, lane 2). Addition of either unlabeled ISRE 
or AP-1 did not efTect the GRR binding complex (Fig. 2B, Zan.es 
3 and 4). A DNA binding complex was also detected when a 
labeled IRFl/GAS element was used in the electrophoretic 
mobility shift assay. Addition of either unlabeled GRR or GAS 
to this binding reaction competed for the labeled GAS binding 
complex (Fig. 2C, lanes 2-4). These data suggest that the VBC 
has approximately the same affinity for the GAS and GRR sites 
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Fit;. 3. Tyrosine phosphorylation is necessary for mainte- 
nance of the shift complex. A, whole cell lysates from vanadate/ 
peroxide-stimulated Schneider 2 cells were untreated (lane /), treated 
with 5 mM phenyl phosphate (lane 2), treated with 5 mM sodium phos- 
phate (lone .?), or treated with 50 mM sodium chloride (lane 4). B, whole 
cell lysates from vanadate/peroxide-stimulated Schneider 2 cells were 
untreated (lane 1), treated with YOP51 {lane 2), or treated with YOP51 
in the presence of 1 mM sodium orthovanadate (lone 3) or 3 mM sodium 
tungstate (lane 4). C, Schneider 2 cells were pretreated with no addition 
(lane /). 500 nM staurosporin for 30 m in (lane 2), or 30 jxg/ml genistein 
for 30 min (lane 3). Cells were then treated with vanadate/peroxide for 
60 min. Whole cell lysates from all experiments were analyzed by 
electrophoretic mobility shift assay. 

as judged by competition analysis but has a lesser affinity for 
the SIE sequence. Electrophoretic mobility shift assays sup- 
ported these findings where the GAS and GRR elements 
showed strong binding and the SIE showed slight but detecta- 
ble binding (data not shown). 

The DNA elements that were bound by the VBC share a 
consensus ATTTCCCNGAAA core region that contains the 
GAS element described in the mammalian system (Fig. 2D) (2). 
The ISRE and AP-1 elements, which did not compete, lack this 
consensus sequence. 

Tyrosine Phosphorylation Is Necessary for the Formation of 
the Binding Complex — It has been clearly established that the 
Stat-like proteins are phosphorylated on tyrosine residues in 
response to cytokine stimulation of cells and that this phospho- 
rylation is necessary for an active complex to be assembled (1, 
14-20). A functional SH2 domain of Static* and phosphoryla- 
tion of tyrosine 701 are necessary for dimer formation in re- 
sponse to interferon y (4). Phosphopeptides corresponding to 
sequences surrounding tyrosine 701 can block dimer formation. 
It has been speculated that this disruption is due to the SH2 
domain of Stat lev binding to the phosphopeptide instead of 
binding to phosphotyrosine 701. Phenyl phosphate can also 
disrupt Stat complexes, presumably by competing for SH2 do- 
main binding (20). Phenyl phosphate added to vanadate/perox- 
ide-stimulated extracts of Schneider 2 cells inhibited the DNA 
binding activity of the VBC (Fig. 3A, lane 2), while equimolar 
amounts of sodium phosphate or a 10-fold higher concentration 
of sodium chloride had no effect on the DNA binding activity 
(Fig. 3A, lanes 3 and 4). The inhibition of DNA binding activity 
by phenyl phosphate can be reversed by removing the salt by 
dialysis (data not shown). These results are consistent with the 
Drosophila vanadate/peroxide-inducible complex containing a 
protein with an SH2 domain. 

Other assays have been used in the mammalian system to 
confirm that Stat-like proteins are tyrosine-phosphorylated 
and that this phosphorylation is important for maintaining a 
DNA binding activity (1, 20, 21). These assays have included 
the treatment of activated extracts with the tyrosine-specific 
phosphatase, YOP51, and pretreatment of cells with the tyro- 
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Fiu. 4. Vanadate/peroxide-stimulated tyrosine phosphoryla- 
tion of two GRR-binding proteins. Partially purified vanadate/per- 
oxide-stimulated binding complex was bound to a GRR affinity column 
in the presence of salmon sperm DNA to compete for nonspecific DNA- 
binding proteins. Proteins adsorbed to the column were eluted in SDS 
sample buffer and separated by SDS-PAGE. The proteins were trans- 
ferred to Immobilon and probed with 4G10. The two dominant proteins 
that were inducibly phosphorylated are indicated. 

sine kinase inhibitor, staurosporin (21). Treatment of a Schnei- 
der 2 cell lysate from vanadate/peroxide-treated cells with the 
tyrosine-specific phosphatase, YOP51, disrupted the DNA 
binding activity (Fig. 3£, lane 2), This inhibition was reversed 
by the addition of either 1 mM sodium vanadate or 3 mM sodium 
tungstate, which are inhibitors of tyrosine-specific phospha- 
tases. The VBC was also inhibited by pretreatment of the 
Schneider 2 cells with the tyrosine kinase inhibitors, genistein 
or staurosporin (Fig. 3C } lanes 2 and 3). These data suggest 
that a tyrosine kinase is involved in the formation of the Dro- 
sophila DNA binding complex and that tyrosine phosphoryla- 
tion is necessary for DNA binding activity. 

Vanadate /Peroxide Induces the Tyrosine Phosphorylation of 
Two Major GRR-binding Proteins, ppIOO and pplSO — VBC 
was partially purified as described under "Materials and Meth- 
ods" and adsorbed to a GRR oligonucleotide affinity resin. The 
loss of the VBC activity from the extract was monitored by 
electrophoretic mobility shift analysis. The DNA binding com- 
plex, which specifically adsorbed to the GRR column, was an- 
alyzed by Western blotting with 4G10, a monoclonal antibody 
that recognizes phosphotyrosine. Two inducibly phosphoryl- 
ated proteins of approximate molecular mass of 100 and 150 
kDa were detected (Fig. 4). Either of these sizes would be 
appropriate for a Drosophila Stat since the mammalian Stat 
proteins range in molecular mass from 84 kDa (Stat 4) to 113 
kDa (Stat 2) (22, 23). Two minor bands of 120 and 130 kDa were 
not consistently detected and could be due to degradation of 
ppl50. 

Southern Analysis Demonstrates That a Drosophila Gene 
Exists Which Hybridizes to the Human Stat la SH2 Domain 
Sequence — Based upon functional evidence presented here, we 
sought to determine whether the Drosophila genome harbors 
DNA sequences related to human Stat lor. Ten jag of normal 
human and Drosophila genomic DNA were digested with 
EcoRI, and the products were separated on agarose gels. To 
correct for differences in genomic complexity of both species, 
2.5 /xg of Drosophila genomic DNA were also analyzed. South- 
ern blot analysis was performed using the SH2 domain coding 
sequence of human Stat lor as a probe. As shown in Fig, 5 } three 
major bands of 3, 4.4, and 10 kilobases, respectively, were 
observed in Drosophila DNA in addition to several minor 
bands. The strongly hybridizing fragments were also detectable 
in 2.5 /ng of Drosophila genomic DNA ensuring specificity of 
hybridization. These findings suggest that the Drosophila ge- 
nome contains at least one and perhaps three genes related to 
the mammalian Static* SH2 domain. 

In summary, we have identified a unique DNA binding ac- 
tivity in Drosophila melanogasler. This activity resembles the 
Stat-like activity that has been extensively characterized in the 
mammalian system. In both the mammalian and the Drosoph- 
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Fig. 5. Statl -related sequences identified in Drosophila 
genomic DNA by reduced stringency hybridization. 10 p.% of 

human genomic DNA {lane i), 10 fig of Drosophila genomic DNA (lane 
2), or 2.5 fig of Drosophila genomic DNA (lane 3) was digested with 
EcoKl and separated on a 0.8% agarose gel. The SH2 domain of human 
Statla was labeled by nick translation and used as a probe. The mo- 
bility of molecular weight standards (A Hindlll) is indicated on the left. 

ila systems, vanadate/hydrogen peroxide treatment of cultured 
cells induces a specific GRR binding complex whose formation 
is dependent upon tyrosine phosphorylation. Detection of 
Stat la-related sequences in the Drosophila genome raises the 
possibility that, as in the mammalian system, Statla-like pro- 
teins are responsible for this activity. Work is in progress to 
obtain the pi*otein sequence and to clone the cDNA encoding 
such novel Stat proteins. 

Work presented here is important to the understanding of 



the mammalian Jak/Stat signaling pathways. Drosophila can 
provide a genetic model of Jak/Stat activation and give insight 
into the significance of conserved protein sequences. This work 
is also important for Drosophila development because, like the 
Drosophila Jak protein, the Stat-like protein may play a role in 
signal transduction during stripe formation. 
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546 CLONING OF GENES ENCODING PROTEIN KINASES ^ ftM^ [46 

Plus/Minus Screening "^17 U.S. Code). ~ 

The screening of PCR libraries generated with the oligonucleotides fun 
PTKl and PTK2 has been based on a random accumulation of sequence lies 
data, followed by subsequent selection of sequences of interest. While in for 
the early phase of this work a high proportion of clones contained new det 
PTK-reMcd sequences, later experiments proved to be somewhat less of 
fruitful. Moreover, the selection of clones for sequencing was a purely duc 
random process and was uninformative with regard to other potentially fan 
important selection criteria, such as expression pattern. An improvement at 
in this respect has been the opportunity to screen PCR libraries for PTK- cl ° 
related sequences which show a tissue-specific or developmentally regu- 
lated pattern of expression. The PCR library generated from the mRNA ser 
source of interest is screened, in duplicate, with 32 P-labeIed PCR product 

from which the library was constructed. The filters are then stripped and mir 

reprobed with a PCR probe generated in an identical fashion from an tur; 

mRNA source other than that used to construct the PCR library. In this ger 

way we have been able to generate differentially expressed P7X-related an£ 

clones expressed in epithelial cells but not fibroblasts (A. Ziemiecki and pro 

A. F. W. , unpublished data). Although this procedure has been employed Wlt 

successfully we have not tested the limits of its sensitivity, nor its utility erb 

in other situations, such as, for example, a differentiation system. as ; 
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qui 

Usi 

By M. H. Kraus and S. A. Aaronson DN 

2-' 



Introduction 



200 
wit 
rati 



Protein-tyro sine kinases (PTKs) are encoded by a conserved family of 
ancestrally related genes that have segregated during evolution within the 
larger family of protein kinases. 1 They share a catalytic domain encoding 
250-300 amino acid residues that represents the region of most extensive 
structural conservation and harbors the intrinsic enzymatic activity. Indi- 2 C. 

vidual members in distinct subfamilies of both receptor-like and cyto- 3 M 

plasmic tyrosine kinases share closer structural and functional homology < Sc 

with each other than with other protein kinases. The higher degree of S 
structural homology among functionally more closely related PTKs pro- * G. 

3X1' 

1 S. K. Hanks, A. M. Quinn, and T. Hunter, Science 241, 42 (1988). 6 E 
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REDUCED STRINGENCY HYBRIDIZATION 547 



vides a basis for the identification of novel members with predictable 
functional properties due to their association with individual PTK subfami- 
lies. Based upon concurrent nucleotide sequence conservation, the search 
for novel PTK genes can be directed toward a specific subfamily without 
detection of more distantly related PTK genes, due to cross-hybridization 
of genomic exon sequences of novel members under quantitatively re- 
duced hybridization stringencies with PTK domain probes of known sub- 
family members . Analyzing normal genomic DNA representing PTK genes 
at single-copy level permitted detection of exon sequences encoding 
closely related novel PTKs independent of their expression pattern. 

Novel genomic restriction fragments are cloned and the most con- 
served region is mapped by Southern blot analysis. Thus, exons with 
highest homology to the probe are identified and their structure is deter- 
mined by nucleotide sequence analysis, establishing the degree of struc- 
tural relatedness and exon organization of a novel related tyrosine kinase 
gene Furthermore, exon-containing probes can be derived for expression 
analysis as well as the isolation of cDNA clones of no vel PTKs . Employing 
probes from the TK domain of v-erbB, v-fms, and v-abl we have isolated 
with erbB-2 2 and erbB-3 3 two novel receptor-like tyrosine kinases in the 
erbBIEGF-R family, a second PDGF-R 4 in the/ms/CSFl-R family , as well 
as a gene closely related to abl, 5 respectively in the human genome. 



Genomic Southern Blot Analysis 

The reliable detection of novel PTK genes at reduced stringency re- 
quires a high sensitivity and specificity of genomic Southern blot analysis . 
y Using a slightly modified procedure we detect human single-copy genomic 

DNA fragments with a completely matching probe at high stringency after 
2-4 hr of film exposure at -70° in the presence of intensifier screens. 

1 Ten micrograms genomic DNA is restricted in a final volume of 
200-400 ul. To monitor restriction, bacteriophage X DNA is incubated 
f with a small aliquot of the main reaction at identical DNA/enzyme unit 

g 2. After complete restriction, DNA samples are purified with an equal 

2 C R. King, M. H. Kraus, and S. A. Aaronson, Science 229, 974 (1985). 
» M. H. Kraus, W. Issing, T. Miki, N. C. Popescu, and S. A. Aaronson, Proc. Natl. Acad. 

V « t! Mats'uC "£Tt. Mild, N. Popescu, W. LaRochelle, M. Kraus, J. Pierce, ana 

)f S. Aaronson, Science 243, 800 (1989). *x^-*riA* 

y s q D . Kruh , c. R. King, M. H. Kraus, N. C. Popescu, S. C. Amsbaugh, W. O. McBnde, 

and S. A. Aaronson, Science 234, 1545 (1986). 
6 E. M. Southern, J. Mol. Biol. 98, 503 (1975). 
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aiu 
asf 



volume of buffered phenol/CIA (chioroform/isoamyi alcohol, 24 : 1) and Fo 

precipitated from 1 M ammonium acetate with 2.5 vol cold ethanol on dry foil 

ice for 20 min. The pelleted DNA is carefully washed with cold 75% (v/v) cpr 

ethanol, lyophilized, and following solubilization in 30 fil 1 x sample buffer tra 

loaded on a 0 . 8- 1 . 2% ( w/v) horizontal agaro se gel (20 x 25 cm ; i . e , , BRL , Un 

Gaithersburg, MD). Electrophoresis proceeds overnight at 35 V in a Tris- Ion 

acetate gel/buffer system (1 x E) containing 0.5 fig/wl ethidium bromide to ; 
with buffer recirculation. 

40 x E buffer: 1 .6 M Tris, 0.8 M sodium acetate, 40 mM EDTA; adjust 2 X 

pH to 7.2 with acetic acid °f ' 

10 x sample buffer: 0.2 M Tris-acetate, pH 7.5, 0.02 M EDTA, 1% (w/v) SS ' 
SDS, 50% (v/v) glycerol, 0.3% (w/v) Bromphenol Blue 

3 . Following gel photography, the gel is irradiated for 2 min with 302- De 
nm UV light to improve transfer of larger DNA fragments, and trimmed 1 
to 20 x 20 cm. The DNA is denatured by two consecutive gel treatments 

of 15 min in 0 . 5 M NaOH/1 . 5 M NaCl and equilibrated with two treatments pli; 
of 15 min each in 1 M ammonium acetate/0.02 M NaOH. Capillary transfer 

in this solution to standard 0 .45-//,m nitrocellulose proceeds overnight with „„ t 

the bottom of the gel facing the membrane. After transfer, the filter is be« 

baked for 2 hr at 80° under vacuum. f ro 

4. Standard high- stringency hybridization is conducted in 5 x SSC (1 x tioi 
SSC = 0.15 M NaCl, 0.015 M sodium citrate, pH 7) and 50% formamide at 

42°, which establishes conditions 20-25° below the T m (melting point) of a ' hyl 

completely matched DNA hybrid. For hybridization the membrane is sut 

placed in a sealing bag, wetted in two-thirds of the final volume (0.075 mi/ tioj 

cm 2 filter area) , and probe is added in the remaining third of hybridization ate 

solution. The bag is sealed and after thorough mixing of the solutions Sin 

placed between two glass plates in a 42° water bath for 8-16 hr. hyl 

Hybridization solution (1 x ): 5 x SSC, 10% (w/v) dextran sulfate, 2.5 x bot 

Denhardt's solution, 10 mM Tris pH 7.4, 50 fig/ml sheared and boiled stn 

salmon sperm DNA, 50% (v/v) formamide, 2-5 x 10 6 cpm/ml at a cer 

DNA concentration of <5 ng/ml res 

Hybridization buffer containing SSC, dextran sulfate, Denhardt's, and tha 

Tris is prepared as 2 x stock solution by dissolving dextran sulfate powder hyl 

in SSC while stirring prior to adding the other components. Nonradioactive of 
and radioactive hybridization solutions are prepared by adding salmon 
sperm DNA to the formamide, and purified probe is added to the latter 
solution. Both are then incubated 5 min at 50°, thoroughly mixed, and 
hybridization buffer is added to 1 x final concentration. This step will 

ensure sufficient denaturation of both carrier and labeled probe DNA. N 
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and For reduced stringency hybridization, the volume is adjusted with water 

dry following denaturation in formamide. High specific activity probes of > 10 9 

v/v) cpm/fig DNA can be routinely obtained with commercially available nick 

iffer translation (i.e., Amer sham, Arlington Heights, IL) or random primer kits. 

RL, Under these conditions "prehybridization" of the filter is not required, as 

Yis- long as the filter is prewetted in nonradioactive hybridization solution prior 

lide to adding the probe. 

5. After hybridization, filters are washed three times for 5 min each in 
2x SSC, 0,01% SDS at room temperature. Two high-stringency washes 
of 20 min each are then conducted at 50° in 0. 1 x SSC/0. 1% SDS and 0. 1 x 
SSC, respectively. 



just 



v/v) 

Detection of Novel Tyrosine Kinase Gene Fragments by Genomic 
' ^ Hybridization Analysis at Reduced Stringencies 

mts Cross-hybridization of related tyrosine kinase gene fragments is accom- 

mts plished by reduced stringency hybridization facilitating hybrid formation 

sfer and stability between partially mismatched sequences. While theoretical 

vith, aspects of the formation of DNA hybrids and sequence homology have 

r is been extensively discussed in the literature, 7 most of these findings derive 

from liquid hybridization and may vary to some extent in filter hybridiza- 
1 x tion where the template DNA is immobilized. 

eat Thus, to determine experimentally the optimal stringency for cross- 

3f a hybridization of a novel related tyrosine kinase gene, it proves useful to 

i is subject replica nitrocellulose filters containing genomic DNA to hybridiza- 

mV tion under stringencies incrementally decreased by 7°. Discrete intermedi- 

ion ate levels of reduced stringency (Table I) may be useful in some cases, 

ons Since stringency is presumed to affect predominantly the formation of 

hybrids during hybridization and the stability of hybrids during washing, 
^ x both hybridization and washing are conducted under equally reduced 

I ^ stringencies. Experimentally these are controlled by the formamide con- 

centration during hybridization and salt concentration during washing, 
respectively, with otherwise identical conditions (Table I). 

It is helpful to relate stringency in temperature degrees (°C) assuming 
uid that decrease of formamide concentration by 1% increases the T m of the 

der hybrid by 0.7° and a logfold increase in ionic strength raises the stability 

ive of a hybrid by 18.5°, 7 ' 8 Thus, considering incubation temperature (T { ), 

ion 

ter 7 M. L. M. Anderson and B. D. Young, in "Nucleic Acid Hybridisation" (B. D, Hames and 

j S. J. Higgins, eds.), p. 73. IRL Press, Oxford, 1985. 

8 G. A. Beltz, K. A. Jacobs, T. H. Eickbush, P. T. Cherbas, and F. C. Kafatos, in "Methods 
in Enzymology" (R. Wu, L. Grossman, and U. Moldave, eds.), p. 266. Academic Press, 
A. New York, 1983. 



it a 



NOV. 18. 2004 1:55PM I N FOG AT E- 1 E INFO EXPRESS NO. 5966 P. 15 

550 CLONING OF GENES ENCODING PROTEIN KINASES [46] 



TABLE I 

Stringency Reduction" 



Stnngency 
reduction (AT",) (°C) 


Hydridization 
(4275 x SSC) 


Washing 
(50°) 


0 


50% FA 


0.1 x SSC -» 0.02 M Na + 


-3.5 


45% FA 


0.15x SSC -» 0.03 M Na* 


- 7 


40% FA 


0.25 x SSC -» 0.04 M Na + 


-10.5 


35% FA 


0.4 x SSC -» 0.08 M Na + 


-14 


30% FA 


0.6 x SSC-> 0.12 MNa T 


-17.5 


25% FA 


1 x SSC -> 0.2 M Na + 


-21 


20% FA 


1.5 x SSC -> 0.3 M Na 4 



a Ar s indicates reduction of stringency relative to high-stringency conditions 
(bold type). FA, Formamide; 1 x SSC = 0.15 M NaCl, 0.015 M sodium 
citrate, pH 7. For each stringency, hybridization is conducted at 42° in 
5x SSC, while washing occurs at 50°. 



formamide concentration (FA), and ionic strength O), experimental strin- 
gency conditions relative to high-stringency hybridization can be approxi- 
mated as 

T s = Jj + 0.7(% FA) - 18.5 logOu/^) 

where T s is the stringency expressed as temperature degrees, T x the incuba- 
tion temperature, and n the ionic strength, while /x 0 represents the ionic 
strength during hybridization conditions equalling 1 M Na + (5 x SSC). At 
these high salt concentrations the hybrid stability is maximal and relatively 
unaffected by variations of ionic strength (log 1 = 0). Under high- 
stringency conditions the estimated T s of the hybridization (r s = 42° + 
0.7° x 50 - 18.5 x log 1) is 77° and the T s during washing (T, = 50° + 
0.7° x 0 - 18.5° x log 0.02) is 81°. Reducing the formamide concentration 
by 10% results in an estimated stringency of J s = 70°, yielding a stringency 
reduction of AT S = -7° in the hybridization. Equivalent reduction of 
the washing stringency requires an increase of the salt concentration to 
0.25 x SSC (0.05 M Na + ). Table I lists experimental conditions with 
the respective estimated reduction in stringency achieved simultaneously 
during hybridization and washing. Since the rate of filter hybridization is 
not affected by formamide concentrations in the range from 50 to 30% and 
only slightly reduced at 20% formamide, 7 hybrid formation at different 
stringencies mainly depends on the degree of mismatches between probe 
and template DNA, eliminating hybridization kinetics as a major variable. 

Under these experimental conditions, we observe comparable signal/noise 
ratios at both high and reduced stringency. 

The extent of stringency reduction required to detect a related se- 
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TYROSINE KIMASF. DOMAIN 



GxGxxC 



IK 



TK1 



1 
I 



TK2 



. 100 bp . 



AT, 




v-erbB 



erbB-3 



PDGFR a 



-14 



-14 




v-erbB 




V-/MIS 



arg 



-11 




v-abl 



Fig 1 Structural relationship of exons encoding novel receptor-like and cytoplasmic 

ATP-binding site (GxGxxG), interkmase region (IK) and a rf 

the autophosphorylation Tyr-416 00 * ™ ™ ££*Z Z^ s ^ uence 

stringency required for detection of related tyrosine kinases, percentage nu h 
LeSs of ^dividual exons (solid boxes) with the homologous regions of the utilized probes 

(open bars) are indicated, 

„ n Q « ^ correlated to some degree with the nucleotide sequence 
SeX score H^/at on' efficiency and hybrid stability of related 
fe^s also depend on the distribution of 

o^ and contiguous 1 » —al c ndti are 

STLEZ^ Zoology between re- 
SX osine kinases tends to coincide with conserved exor , strucmr. 
For instance, three exons that have been charactered in *e tyros ™ 
kinase domains of both erbB-2 and erbB-3 genes exh.bit . denUc al s p I ce 
junctions, while these splice borders differ from those °f tyros.ne taases 
of other subfamilies (Fig. 1). Thus, due to the 

sequence and exon structure, the approach enhances the P™bab.htyi tor 

detection of gene fragments most closely related to, teg 

the probe is derived. ^J^^^^TjSZ 
scores between exons of novel FIas cross-nyDnuiz,m& vy r 

the r^ucta of stringency required for the detection of smgle-copy gene 
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kbp 

- 23 - 

- 9.4 - 

- 6.4 - 



1 2 3 4 5 6 



1 2 3 4 5 6 



! - 4.3 - 



- 2.3 - 





i 



B 



D 



Fig. 2. er&S-Related sequences in the human genome. Replica Southern blots containing 
£coRI flanes i_3). or Sad (lanes 4-6)-cleaved normal human genomic (I, 4), MDA-MB468 
(2, 5), and SK-BR-3 (3, 6) DNAs were hybridized with v-erbB at intermediate (AT", = -7°) 
(A) and reduced (AT S = -14°) (B) or with an exon-specific erbB-3 probe at highV) and 
reduced stringencies (AT S = - 14°) (D). The migration positions of novel erA£-related EcoRl 
and Sad restriction fragments identified under reduced stringency hybridization with v-erbB 
as a probe are indicated by arrowheads in (B). 



fragments in normal DNA. This suggests that a reduction of stringency by 
14° is adequate to detect related tyrosine kinase exons sharing in the range 
of 61 to 80% nucleotide sequence identity with the probe. In the EGF-R 
family, single gene copies of the more closely related erbB-2 were detected 
at only 7° stringency reduction (Fig. 1). 

A typical experimental approach is illustrated in Fig. 2. Replica nitro- 
cellulose filters are generated containing several restriction digests of nor- 
mal genomic DNA (lanes 1 and 4). The use of multiple enzyme digests 
increases the probability of detecting a novel PTK fragment distinct in size 
from those of the cognate gene. Furthermore, it is possible to exclude 
restriction fragment length polymorphisms. In this example, conducted 
for the identification of erbB-3? we controlled for the gene fragments 
of known family members by their amplification in tumor cell DNAs. 
Alternatively, restriction fragments of known family members are identi- 
fied by high-stringency hybridization with a perfectly matched probe. 
Using v-erbB as a probe at a stringency reduced by 7° (Fig. 2A), only 
restriction fragments specific to the EGF-R or erbB-2 genes hybridized as 
determined by their amplification in MDA-MB468 (lanes 2 and 5) or SK- 
BR-3 (lanes 3 and 6), respectively. An additional 7° reduction of stringency 
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to M = -14° facilitated the detection of a novel v-er&B-related gene 
Linctfrom EGF-R and erbB-2 genes It was most S ^ 
digest as a 9-kbp restriction fragment (Fig. 2B, lanes 4-6), but was also 
Seemed as a 3 J-kbp EcoRI fragment (Fig. 2B, lanes 1-3). The : specific 
detec Z lot . novel ^-related gene was inferred by absence of hybrid- 
ization under the higher stringency conditions and the lack of gene amplifi- 
cation in MDA-MB468 and SK-BR-3 cells. 

Evaluation of Southern blots and subsequent genomic cloning are facili- 
tated by choosing restriction enzymes and probes such that single or 
m few bands for individual PTK genes are generated. If exon stmc ure and 

§f nucleotide sequence conservation are known for a subfamdy m interest, 

1 a shorter probe is preferred, as the probability of multiple hybridizing 

m fegments decreases for each member. Otherwise, longer probes spanning 

S exons are preferable, since the region of relatwely highest conser- 
vation within the TK domain varies among distinct families (F.g_ 1). Je 
lave successfully utilized probes extending from a single exon (135 bp to 
most of the tyrosine kinase domain, respectively. Endonucleases cleaving 
Z JZ Sequence in the probe region should be avoided. This is under- 
bred ^comparison of the Sad with the EcoRI digest in Fig. using a 
t 1-kbp v-^B probe at reduced stringency. The pr ^ 

3.7-kbp EcoRI erbB-3 fragment is obscured br our E 
erbB-2 fragments in the range of 3-8 kbp, whereas tne 5 ko P 
specific Sad fragment is readily distinguished from two EGF-R- and one 
er&B-2-specific fragments. 

y 

■e Characterization of Novel Tyrosine Kinase Exons 

R 
d 



For further characterization, a novel related restriction fragment is 
*£E£?JL** library constructed by standar dn^» 
bacteriophage X. At this step it is helpful to ennch the 
for the novel restriction fragment (i.e., sucrose grad, ^trfiW 
DEAE membrane elution from agarose gel) with ^ attempt t . exclude 
restriction fragment sizes of the cognate genes 

of interest are identified by in situ hybridization at the reduced stringency 
initially determined by genomic Southern blot hybridization, 
is 1. The bacteriophages are propagated according to f™^? ^ 

dures in 8 ml 0.7% top agarose in 150-mm plates 
ti- layer at a density and plaque size that ^f^^^SZ 

e- nn the nlates After lifting the filters are air dned (15 mm) and treated on 

IV 3MM Whitman sheets w!th the absorbed bacteriophages facing upward. 

» T. Maniatis, E. F. Fritseh, and J. *^ O-gg A L ™ 

Cold Spring Harbor Lab., Cold Spring Harbor, New York, 1982. 
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Following denaturation (1 min) in 0.5 M NaOH/1 .5 M NaCl the filters are are 
briefly drained of excess alkali on dry 3MM paper, equilibrated with 1 M app 
NH 4 OAc/0.02 M NaOH (5 min), air dried, and baked for 2-3 hr at 80° in use 
a vacuum oven. 

2. For an efficient transfer of colonies of plasmid-based libraries from 
the plates onto nitrocellulose filters, 1.1% agar plates are used, where the 

melted agar has been cooled to 45° prior to pouring the plates . The marked f? 1 
filters are lifted gently , and with the colony side facing upward instantly J? ei 
placed onto 3MM paper saturated with denaturing solution. Sufficient lysis ™ 
can be visually monitored, as the colonies become translucent (1-2 min). ' 
The filters are then treated essentially as described above. To minimize ■ 
background, the filters following treatment with 1 M NH 4 OAc/0.02 M \ 
NaOH are pressed while still wet between dry sheets of 3MM paper y 
by firmly rolling a 25-ml pipette across the sandwich. By peeling the * re 
nitrocellulose filter off the 3MM paper, exces s bacterial debris will adhere T. " 

to the 3MM paper whereas DNA remains bound to the nitrocellulose 
membrane. 

3. Hybridization and washing conditions are essentially as described 
for genomic Southern blot hybridization. Hybridization is conducted in 
150-mm Petri dishes. Radioactive and nonradioactive hybridization solu- 
tions are prepared as described earlier and poured into separate dishes. 
Filters are individually wetted in cold hybridization solution and sub- 
merged in hybridization solution containing the probe (1-3 x 10 6 cpm/ 
ml). The covered dish enclosed in a bag or (3 radiation-safe container is 
rotated slowly (< 100 rpm) in a 42° orbital shaker for overnight hybridiza- P _ 
tion. As" a guideline, 30 large round filters can be hybridized in 1 dish 
containing -100 ml of radioactive hybridization solution. An additional 
100 ml of nonradioactive hybridization solution is required to wet the 
nitrocellulose filters. 

4. Washing of the filters is carried out in 4-liter covered glass beakers 
throughout. As many as 50 filters can be processed in 1 container. For 
even washing, the filters must remain in motion and submerged in the 
solution. A volume of 1-1.5 liters of washing solution for each beaker is 
adequate. Four room-temperature washes of 1 5-20 min each are followed 
by two final stringency washes of 30 min each. It is advisable to monitor 

the actual temperature in the solution during the final stringency washes res1 
at 50° in order to ensure correct stringencies. re P 

5. Clones for the cognate genes can be distinguished by hybridizing 
replica filters from the same plates with probes specific for those genes 
under high strigency. More conveniently, 1 fd of first cycle positive phage 
stocks are applied to gridded plates by piercing the surface of the top 
agarose that contains susceptible bacterial cells with a pointed pipette tip. nov 
Phage lysis will occur after a 10- to 12-hr incubation at 37°. These plates stn 
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* are then used for the generation of replica nitrocellulose filters . The latter 
f approach for distinguishing clones of the cognate genes is particularly 
i useful, when many first round positives require further analysis. 

For the characterization of erbB-3, a normal human genomic DNA 
library enriched for 8- to 12-kbp Sacl restriction fragments was con- 

j structed in bacteriophage. Among 4 x 10 5 recombinants, 29 positives were 

identified using v-erbB as a probe under reduced hybridization stringency. 
One positive represented an EGF-R clone, whereas none was detected for 
erbB-2. Several positives plaques are purified and the phage DNAs are 

[ subjected to restriction and hybridization analysis at reduced stringency 

* in order to map the region of homology within the phage DNA insert. 
' Hybridizing insert fragments of suitable size (preferably 1 kpb or less) 
[ are then subcloned in plasmid vectors with high replication rate (pUC, 
I Bluescript, etc.) and subjected to nucleotide sequence analysis by the 
I dideoxy chain termination method using supercoiled DNA as template. 

Gene-specific probes from the genomic clone can be used to investigate 
j at high-stringency mRNA expression in various sources. In addition, ge- 

^ netic alterations including gene amplification or rearrangement as well as 

altered expression for a novel tyrosine kinase gene can be searched for in 
human disease. The predicted amino acid sequence of exons can be used 
to design peptides for the generation of polyclonal antisera in efforts 
f to study the encoded gene product. For structural characterization and 

functional studies involving in vitro expresssion of the protein, the com- 
plete coding sequence can be isolated by cDNA cloning using gene-specific 
exon-containing probes at high stringency. The initial expression analysis 
is useful in detection of sources with relatively high transcript levels for 
cDNA cloning. 

Utilizing a single genomic exon probe of erbB-3 (Fig. 1, downstream 
exon) we rehybridized the initial genomic Southern blot under both high 
and reduced stringencies. This probe detected under high stringency the 
erbB-3-speciRc genomic fragments initially identified by v-erbB under re- 
duced stringency (Fig. 2C). As expected at the same level of stringency 
reduction, the erbB- 3- specific exon probe recognized only the homologous 
EGF-R- and er&B-2-specific gene fragments in addition to the endogenous 
restriction fragments (Fig. 2D), corroborating the specificity as well as 
reproducibility of this approach. 

Discussion 

In this chapter, we have summarized a general approach for detecting 
novel PTK-related genes in genomic DNA. Based upon knowledge of 
structural conservation in a distinct PTK subfamily, the approach can be 
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556 CLONING OF GENES ENCODING PROTEIN KINASES [46] [47 

optimized for specific tasks by choice of restriction enzymes , probe design, 
and genomic DN A source . Several other approaches (see Chapters 44, 45 , 
47 in this volume), including reduced stringency hybridization of cDNA 1 
libraries with TK probes or degenerate oligonucleotides, have been em- T ' 
ployed in the successful isolation of novel PTKs. More recently, the 
polymerase chain reaction with degenerate oligonucleotides as primers 
has been utilized to isolate novel TK-coding sequences from DNA comple- 
mentary to mRNA. Finally, screening of cDNA expression libraries with 
anti-phosphotyrosine antibodies has facilitated the successful isolation of P at 
PTKs based upon their expression at the protein level. tn0 

Approaches utilizing RN A as a source either combine criteria for struc- 
tural relationship and levels of expression of a novel PTK or are strictly 
based on protein function. In the latter case this might allow the isolation 
of PTKs lacking sufficient sequence conservation for detection by the 
other methods described. Those techniques involving RNA as the source tim 
have the potential advantage of providing a more rapid determination of mo 
larger portions of coding sequence than the genomic method . On the other 8 en 
hand , the search for novel PTKs in mRNA-derived sources could be biased ft* 1 
by the expression pattern. For example, a highly conserved, novel PTK nav 
could be missed, if not sufficiently expressed in a particular cell source, tne 
or inefficiently converted to cDNA due to its structure. fie< ; 

Since normal genomic DNA contains a complete set of genes encoding cD 
PTKs, the genomic approach is most comprehensive in the search for 8 en 
structurally conserved genes. Moreover, titration of stringency reduction mo 
on the genomic Southern blot can provide information about the number ca F 
of different genes detected at a given stringency. A potential disadvantage hav 
of this method is the detection of nonfunctional genes, although in our • tl01 

search for novel PTKs we have not encountered such genes. While an 
intermediate genomic cloning step is required, knowledge of the expres- 
sion pattern of a novel PTK prior to cDNA cloning may also be critical 
to the subsequent isolation of its complete coding sequence from the 
appropriate cDN A libraries . Hence , each of the described methods entails nav 
certain advantages, and the method of choice is determined by the specific' eve 
goal. While those methods utilizing cDNAs may be particularly valuable P nc 
in the isolation of tissue-specific expressed novel PTKs or molecules of P ro 
more divergent genetic structure, genomic identification of novel PTKs 
appears most suitable for the systematic search for structurally related act 
PTK coding sequences. 
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Fig. 2. Cerebellar cortex of the same rat as Fig. 1. Unstained autoradio- 
gram of a cryostat section. M, Molecular layer; P, Purkinje cells (non- 
reacting); G, granular layer; W, white matter. 



Fig. 3. Lumbar spinal cord (anterior horn) of a rat, 45 min after a single 
injection of TSC- M 0. Bouin-nxed and embedded specimen, counter- 
stained with haematoxylin and eosin. Motoneuron es (M) are non-react- 
ing; capsular glial cells (arrow) and astrocytes in the grey matter (arrow 
with a circle) contain numerous grains. 



binding by non-reacting structures might indicate those 
cellular elements that do not exert GAD activity. Such 
an approach is actually an adaptation of the principle 
described by Ostrovsky and Barnard 5 . 

Hats weighing 160 g were injected intraperitoneal^ 
with 5-20 mg/kg thiosemicarbazide- 14 C (specific activity 
10 mCi/rnmole, obtained from the Central Isotope Institute, 
-Budapest). Characteristic convulsions and lethal "jumps" 
appeared 45-120 min after injection. The animals were 
billed by decapitation and samples of the central nervous 
system (brain, medulla, cerebellum, spinal cord) as well 
as other tissues were either fixed on Bouin's solution and 
prepared for autoradiography in the usual way or were not 
xed but frozen with dry ice, cut on a cryostat and applied 
to slides pre-coated with Kodak stripping film, 

emulsion side up. The exposure time was 3-8 weeks; 
autoradiograrns were developed in Kodak autoradio- 
graphic developer and some were counterstained with 
haematoxylin and eosin. 

*ig. 1 shows a coronal section of the brain of a rat given 
^0 intraperitoneal injections of 10 mg/kg TSC- 14 C, with 
an interval of 40 min between them, and then killed 90 min 
th \ ^ e ^ a i ec ^ on ' The heaviest reaction is confined to 
frppocampus and fascia dentata chiefly in the layer of 
Ppocampal pyramids (Fig. 1, inset); silver grains are, 
Wever » not found in the nerve cells themselves but are 
^centrated in the surrounding neuropil. Activity is high 



563 

also in the nucleus habenularis medialis. There is a 
moderate reaction in the cortex, especially in the vicinity 
of the interhemispherical fissure. 

Both the molecular layer and the granular layer of the 
cerebellum contain numerous silver grains (Fig. 2). The 
localization pattern of the reaction does not, however T 
conform to any of the neural elements but resembles more 
closely the glial structure. The reaction is slightly stronger 
around Purkinje cells, which are themselves devoid of any 
reaction both in pre -fixed and in non-fixed specimens. On 
the other hand, in deep cerebellar nuclei (and also in some 
of the brain stem nuclei, for example the substantia nigra 
and nuclei pontis) silver grains are located within the 
cytoplasms of nerve cells. Weak or virtually no activity 
can be seen in the white matter, although the glial cells 
in the white matter react if higher doses of TSC are used or 
if the animals are killed after a shorter time interval. 

The nerve cells in the spinal cord do not contain silver 
grains, but capsular and other glial cells in both the grey 
and white matter contain numerous grains (Fig. 3). 
According to Curtis 2 , GrABA is not involved in spinal 
inhibitory mechanisms. 

The gross distribution of silver grains reduced by TSC- 
14 C is in accord with the distribution of GAD and GABA 
anticipated on the basis of earlier biochemical and pharma- 
cological studies (cerebellar cortex 2 , hippocampus 2 , sub- 
stantia nigra 6 ). While the concentration of silver grains 
in nerve cells (dentate nucleus, substantia nigra) and in the 
neuropil surrounding them (hippocampus) is consistent 
with current views, it is striking that Purkinje cells do not 
exert any reaction. Proteins sensitive to TSC are undoubt- 
edly more widespread than GAD, so the apparent localiza- 
tion of TSC in glial cells of the cerebellar cortex may bepartly 
due to binding of the drug to other B 6 -dependent enzymes. 
The possibility cannot be excluded, however, that, at 
least in this area, GABA is produced by glial cells, perhaps 
in order to be taken up by nerve terminals and/or nerve 
cells in a second step . 
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Natural Selection and the Concept 
of a Protein Space 

Sausbuby 1 has argued that there is an apparent con- 
tradiction between two fundamental concepts of biology — 
the belief that the gene is a unique sequence of nucleotides 
whose function it is to determine the sequence of amino - 
acids in a protein, and the theory of evolution by natural 
selection. In brief, he calculated that the number of 
possible amino -acid sequences is greater by many orders of 
magnitude than the number of proteins which could have 
existed on Earth since the origin of life, and hence that 
functionally effective proteins have a vanisliingly small 
chance of arising by mutation. Natural selection is 
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therefore ineffective because it lacks the essential raw 
material — favourable mutations. 

I should like to look at the problem from a different 
point of view. I shall assume that mutations, while not 
random in a chemical sense, are random as far as their 
chances of improving the function of the corresponding 
proteins are concerned. I shall also assume that evolution 
has occurred either by the natural selection of favourable 
mutations or by the chance fixation by genetic drift of 
selectively neutral mutations. The justification for making 
these assumptions is that no sensible alternatives have been 
suggested and that no evidence exists at the moment to 
invalidate them. If these assumptions are true what can 
we say about the frequency and distribution of amino- 
acid sequences which are functional, either as enzymes or 
in some other way ? 

The model of protein evolution I want to discuss is 
best understood by analogy with a popular word game 
The object of the game is to pass from one word to another 
of the same length by changing one letter at a time with 
the requirement that all the intermediate words are 
meaningful in the same language. Thus WORD can be 
converted into GENE in the minimum number of steps 
as follows: ^ > 



WORD WORE GORE GONE GENE 
This is an analogue of evolution, in which the words 
represent proteins; the letters represent amino-acids- 
the alteration of a single letter corresponds to the simplest 
evolutionary step,, the substitution of one amino-acid for 
■another; and the requirement of meaning corresponds to 
the requirement that each unit step in evolution should 
be irom one functional protein to another. The reason 
lor the last requirement is as follows: suppose that a 
protein ABCD... exists, and that a protein a b C D 
... - would be favoured by- selection if it arose. Suppose 
iurther that the intermediates a B C D . . . and A b C D 
... are non-functional. These forms would arise by 
mutation, but would usually be eliminated by selection 
betore a second mutation could occur. The double step 
from abCD...toABCD would thus be very unlikely 
to occur. Such double steps with unfavourable inter- 
mediates may occasionally occur, but are probably too 
rare to be important in evolution. 

This is a model of the way in which one gene may change 
mto another. An increase in the number of different 
genes in a single organism presumably occurs by the 
duplication of an already existing gene followed by 
divergence. If so, it remains true that new genes arise 
as modifications of pre-existing ones. 

It follows that if evolution by natural selection is to' 
occur, functional proteins must form a continuous network 
which can be traversed by unit mutational steps without 
passing through nonfunctional intermediates. In this 
respect, functional proteins resemble four-letter words in 
the English language, rather than eight-letter words, for 
the latter form a series of small isolated islands in a sea of 
nonsense sequences. Of course, this is not to deny the 
existence of isolated island proteins, analogous to the 
four-letter words ALSO and ALTO. 

It is easy to state the condition which must be satisfied 
it meaningful proteins are to form a network. Let X be a 
meaningful protein. Let N be the number of proteins 
which can be derived from X by a unit mutational step, 
and / the fraction of these which are meaningful, in the 
sense of being as good as or better than X in some environ- 
ment. Then, rf/iV>l, meaningful proteins will form a 
network, and evolution by natural selection is possible 
in estimating N it is necessary to distinguish two classes of 
mutations: (i) substitutions of single amino-acids, and 
additions or deletions of small numbers of amino-acids 
making only a small change to the protein; and (ii) 
mutations producing a major change in amino-acid 
sequence, such as frame shifts and intramolecular inver- 
sions. 
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t Mutations of the former type are much more likely f 
give rise to meaningful proteins than the latter. In th 
same way, a single random letter substitution in 7 
meaningful word is more likely to give rise to a meaning 
word than the simultaneous alteration of all the letter* 
Although frame shift mutations are known to occur 
it is not clear whether they have ever been incorporate 
in evolution. It is therefore better to take /as III 
number of possible substitutions of single amino-acids 
it ah substitutions were possible in a single mutational 
step, N for a protein of 100 amino-acids would be 1 900 
In practice the genetic code limits N to approximately W 

^± I tl ef ? iUSt be g reater than 1/1,000. It does not follow 
tnat the fraction of all possible sequences which art* 
meaningful need be as high as 1/1,000. It is probably much 
lower. There is almost certainly a higher probability 
that a sequence will be meaningful if it is a neighbour of 
an existing functional protein than if it is selected at 
random In fact, in treating N as the number of amino 
acid substitutions rather than as the total number of 
possible mutational steps, it was in effect assumed that a 
random sequence has a negligibly small probability of 
being functional; this assumption will be confirmed if it 

in^evo°lution at ^ ° r n6Ver incor P orate <i 

Suppose now that we imagine all possible amino-acid 
sequences to be arranged in a "protein space", so that 
two sequences are neighbours if one can be converted into 
another by a single amino-acid substitution. Then the 
requirement that /JV> 1 requires that the "densitv" of 
tunc tional proteins in certain regions of the space must be 

w^^q i ? u~ P 1 rhapS greater than VI. 000. This agrees 
with Salisbury s conclusion that proteins, and hencl the 
genes that determine them, cannot be as unique as all 
tiiat. As a convinced Darwinist, I published 2 the con- 
clusion that fN>l when little was known about the 
frequency of ammo -acid substitutions in evolution. Since 
then evidence has accumulated (for a review, see Kine and 
Jukes*) that many substitutions are either selectively 
neutral or at least make comparatively minor changes 
in the function of proteins. 

If/tf> 1, no quantitative difficulty arises in explaining 
the evolution of proteins by natural selection. A difficulty 
nevertheless remains in explaining the origin of life— that 
is, in explaining the origin of the first functional proteins 
together with the genetic mechanism for producing them 
It it were true that only a minute fraction of possible 
amino-acid sequences have even the slightest enzymatic 
activity, it would be difficult to understand how the first 
proteins arose. I do not want to discuss the problem of 
the origin of life, but only to point out that it is a quite 
different problem from that of the mechanism of evolution. 
• borne questions about molecular evolution can be 
terminated more clearly in terms of a protein space, 
-b or example : (i) Are all existing proteins part of the same 
continuous network, and if so, have they all been reached 

1 Smg 6 startin S P° irlt ? Possible alternatives are 
tnat there are two or more distinct networks, or that 
there is one network with multiple starting points, 
(ii) How often, if ever, has evolution passed through a 
nonfunctional sequence ? If so, has this been achieved 
by the random walk of genes rendered redundant by 
duplication, or by the chance concurrence of two or more 
mutations ? (iii) What fraction of the functional net- 
work has already been explored in evolution ? (iv) What 
traction of potentially useful proteins are inaccessible ? 
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Deciphering the Message in Protein Sequences: 
Tolerance to Amino Acid Substitutions 

James U. Bowie/ John F. Reidhaar-Olson, Wendell A. Lim, 

Robert T. Salter 



specific positions in a cloned gene and uses selections or screens to 
An amino acid sequence encodes a message that deter- identify functional sequences. This approach has been used to great 
mines the shape and function of a protein. This message is advantage for proteins that can be expressed in bacteria or yeast, 
highly degenerate in that many different sequences can where the appropriate genetic manipulations are possible (3, 8-11). 
code for proteins with essentially the same structure and The end results of both methods are lists of active sequences that can 
activity. Comparison of different sequences with similar be compared and analyzed to identify sequence features that are 
messages can reveal key features of the code and improve essential for folding or function. If a particular property of a side 
understanding of how a protein folds and how it per- chain, such as charge or size, is important at a given position, only 
forms its function. side chains that have the required property will be allowed. Con- 

: versely, if the chemical identity of the side chain is unimportant, 

then many different substitutions will be permitted. 
Studies in which these methods were used have revealed that 

The genome is manifest largely in the set of pro- proteins are surprisingly tolerant of amino acid substitutions (2-4, 
teins that it encodes. It is the ability of these proteins to fold 11). For example, in studying the- effects of approximately 1500 
into unique three-dimensional structures that allows them to * single amino acid substitutions at 142 positions in lac repressor, 

Miller and co-workers found that about one-half of all substitutions 
were phenotypically silent (11). At some positions, many different, 
nonconscrvative substitutions were allowed. Such residue positions 
play little or no role in structure and function. At other positions, no 
substitutions or only conservative substitutions were allowed. These 
residues are the most important for lac repressor activity. 

What roles do invariant and conserved side chains play in 
proteins? Residues that arc direcdy involved in protein functions 
such as binding or catalysis will certainly be among the most 
conserved. For example, replacing the Asp in the catalytic triad of 
trypsin with Asn results in a 10 4 -fold reduction in activity (12). A 
similar loss of activity occurs in X repressor when a DNA binding 
residue is changed from Asn to Asp (13). To carry out their 
function, however, these catalytic residues and binding residues 
must be precisely oriented in three dimensions. Consequently, 
mutations in residues that arc required for structure formation or 
stability can also have dramatic effects on activity (10, 14-16). 
Hence, many of the residues that are conserved in sets of related 
sequences play structural roles. 



Substitutions at Surface and Buried Positions 

In their initial comparisons of the globin sequences, Perutz and 
co-workers found that most buried residues require nonpolar side 
chains, whereas few features of surface side chains are generally 
conserved (6). Similar results have been seen for a number of protein 
families (2, 4, 5, 7, 17, 18). An example of the sequence tolerance at 
surface versus buried sites can be seen in Fig. 1, which shows the 
allowed substitutions in \ repressor at residue positions that are near 
the dimer interface but distant from the DNA binding surface of the 
protein (9). These substitutions were identified by a functional 



function and carry out the instructions of the genome. Thus, 
comprehending the rules that relate amino acid sequence to struc- 
ture is fundamental to an understanding of biological processes. 
Because an amino acid sequence contains all of the information 
necessary to determine the structure of a protein (i), it should be 
possible to predict structure from sequence, and subsequently to 
infer detailed aspects of function from the structure. However, both 
problems are extremely complex, and it seems unlikely that either 
will be solved in an exact manner in the near future. It may be 
possible to obtain approximate solutions by using experimental data 
to simplify the problem. In this article, we describe how an analysis 
of allowed amino acid substitutions in proteins can be used to 
reduce the complexity of sequences and reveal important aspects of 
structure and function. 



Methods for Studying Tolerance to 
Sequence Variation 

There are two main approaches to studying the tolerance of an 
amino acid sequence to change. The first method relies on the 
process of evolution, in which mutations are either accepted or 
rejected by natural selection. This method has been extremely 
powerful for proteins such as the globins or cytochromes, for which 
sequences from many different species are known (2-7). The second 
approach uses genetic methods to introduce amino acid changes at 
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Fig. 1. (A) Amino acid substitutions allowed in a 
short region of X repressor. The wild-type se- 
quence is shown along the center line. The al- 
lowed substitutions shown above each position 
were identified by randomly mutating one to 
three codons at a time by using a cassette method 
and applying a functional selection (9). (B) The 
fractional solvent accessibility (42) of the wiid- 
type side chain in the protein dimer (43) relative 
to the same atoms in an Ala-X-Ala model tripep- 
tide. 
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selection after cassette mutagenesis. A histogram of side chain 
solvent accessibility in the crystal structure of the dimer is also 
shown in Fig. 1. At six positions, only the wild-type residue or 
relatively conservative substitutions are allowed. Five of these 
positions are buried in the protein. In contrast, most of the highly 
exposed positions tolerate a wide range of chemically different side 
chains, including hydrophilic and hydrophobic residues. Hence, it 
seems that most of the structural information in this region of the 
protein is carried by the residues that are solvent inaccessible. 



Constraints on Core Sequences 

Because core residue positions appear to be extremely important 
for protein folding or stability, we must understand the factors that 
dictate whether a given core sequence will be acceptable. In general, 
only hydrophobic or neutral residues are tolerated at buried sjtes in 
proteins, undoubtedly because of the large favorable contribution of 
the hydrophobic effect to protein stability (19). For example, Fig. 2 
shows the results of genetic studies used to investigate the substitu- 
tions allowed at residue positions that form the hydrophobic core of 
the NH 2 - terminal domain of \ repressor (20). The acceptable core 
sequences are composed almost exclusively of Ala, Cys, Thr, Val, lie, 
Leu, Met, and Phe. The acceptability of many different residues at 
each core position presumably reflects the fact that the hydrophobic 
effect, unlike hydrogen bonding, does not depend on specific 
residue pairings. Although it is possible to imagine a hypothetical 
core structure that is stabilized exclusively by residues forming 
hydrogen bonds and salt bridges, such a core would probably be 
difficult to construct because hydrogen bonds require pairing of 
donors and acceptors in an exact geometry. Thus the repertoire of 
possible structures that use a polar core would probably be extreme- 
ly limited (21). Polar and charged residues are occasionally found in 
the cores of proteins, but only at positions where their hydrogen 
bonding needs can be satisfied (22). 

The cores of most proteins are quite closely packed (23), but some 
volume changes are acceptable. In X repressor, die overall core 
volume of acceptable sequences can vary by about 10%. Changes at 
individual sites, however, can be considerably larger. For example, 
as shown in Fig. 2, both Phe and Ala are allowed at the same core 
position in the appropriate sequence contexts. Large volume 
changes at individual buried sites have also been observed in 



phylogenetic studies, where it has been noted that the size decreases 
and increases at interacting residues are not necessarily related in a 
simple complementary fashion (5, 7, 17). Rather, local volume 
changes are accommodated by conformational changes in nearby 
side chains and by a variety of backbone movements. 



The Informational Importance of the Core 

With occasional exceptions, the core must remain hydrophobic 
and maintain a reasonable packing density. However, since the core 
is composed of side chains that can assume only a limited number of 
conformations (24), efficient packing must be maintained without 
steric clashes. How important are hydrophobicity, volume, and 
steric complementarity in determining whether a given sequence can 
form an acceptable core? Each factor is essential in a physical sense, 
as a stable core is probably unable to tolerate unsatisfied hydrogen 
bonding groups, large holes, or steric overlaps (25) . However, in an 
informational sense, these factors are not equivalent. For example, in 
experiments in which three core residues of \ repressor were 
mutated simultaneously, volume was a relatively unimportant infor- 
mational constraint because three-quarters of ail possible combina- 
tions of the 20 naturally occurring amino acids had volumes within 
the range tolerated in the core, and yet most of these sequences were 
unacceptable (20). In contrast, of the sequences that contained only 



Fig. 2. Amino acid substitu- 
tions allowed in the core of X 
repressor. The wild-type side 
chains are shown pictorially in 
the approximate orientation 
seen in the crystal structure 
(43). The lists of allowed sub- 
stitutions at each position arc 
shown below the wild-type 
side chains. These substitu- 
tions were identified by ran- 
domly mutating one to four 
residues at a time by using a 
cassette method and applying 
a functional selection (20). 
Not all substitutions are al- 
lowed in every sequence back- 
ground. 
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^ the appropriate hydrophobic residues, a significant fraction were 

acceptable. Hence, the hydrophobicity of a sequence contains 
more information about its potential acceptability in the core than 
does the total side chain volume. Steric compatibility was intermedi- 
ate between volume and hydrophobicity in informational impor- 
tance. 



The Informational Importance of Surface Sites 

We have noted that many surface sites can tolerate a wide variety 
of side chains, including hydrophilic and hydrophobic residues. This 
result might be taken to indicate that surface positions contain Iitde 
structural information. However, Bashford et a/., in an extensive 
analysis of globin sequences (4), found a strong bias against large 
hydrophobic residues at many surface positions. At one level, this 
may reflect constraints imposed by protein solubility, because large 
patches of hydrophobic surface residues would presumably lead to 
aggregation. At a more fundamental level, protein folding requires a 
partitioning between surface and buried positions. Consequently, to 
achieve a unique native state without significant competition from 
other conformations, it may be important that some sites have a 
decided preference for exterior rather than interior positions. As a 
result, many surface sites can accept hydrophobic residues individ- 
ually, but the surface as a whole can probably tolerate only a 
moderate number of hydrophobic side chains. 

Identification of Residue Roles from 
Sets of Sequences 

Often, a protein of interest is a member of a family of related 
sequences. What can we infer from the pattern of allowed substitu- 
tions at positions in sets of aligned sequences generated by genetic 
or phylogenetic methods? Residue positions that can accept a 
number of different side chains, including charged and highly polar 
residues, are almost certain to be on the protein surface. Residue 
positions that remain hydrophobic, whether variable or not, are 
likely to be buried within the structure. In Fig. 3, those . residue 
positions in X repressor that can accept hydrophilic side chains are 
shown in orange and those that cannot accept hydrophilic side 
chains are shown in green. The obligate hydrophobic positions 
define the core of the structure, whereas positions that can accept 
hydrophilic side chains define the surface. 

Functionally important residues should be conserved in sets of 
active sequences, but it is not possible to decide whether a side chain 
is functionally or structurally important just because it is invariant or 
conserved. To make this distinction requires an independent assay of 
protein folding. The ability of a mutant protein to maintain a stably 
folded structure can often be measured by biophysical techniques, 
by susceptibility to intracellular proteolysis (25), or by binding to 
antibodies specific for the native structure (27, 28). In the latter 
cases, it is possible to screen proteins in mutated clones for the 
ability to fold even if these proteins are inactive. Sets of sequences 
that allow formation of a stable structure can then be compared to 
the sets that allow both folding and function, with the active site or 
binding residues being those that are variable in the set of stable 
proteins but invariant in the set of functional proteins. The DNA- 
binding residues of Arc repressor were identified by this method (8).. 
The receptor-binding residues of human growth hormone were also 
identified by comparing the stabilities and activities of a set of 
mutant sequences (28). However, in this case, the mutants were 
generated as hybrid sequences between growth hormone and related 
hormones with different binding specificities.. 

1308 



Implications for Structure Prediction 

At present, the only reliable method for predicting a low- 
resolution tertiary structure of a new protein is by identifying 
sequence similarity to a protein whose structure is already known 
(29, 30). However, it is often difficult to align sequences as the level 
of sequence similarity decreases, and it is sometimes impossible to 
detect statistically significant sequence similarity between distandy 
related proteins. Because the number of known sequences is far 
greater than the number of known structures, it would be advanta- 
geous to increase the reach of the available structural information by 
improving methods for detecting distant sequence relations and for 
subsequently aligning these sequences based on structural principles. 
In a normal homology search, the sequence database is scanned with 
a single test sequence, and every residue must be weighted equally. 
However, some residues are more important than others and should 
be weighted accordingly. Moreover, certain regions of the protein 
arc more likely to contain gaps than others. Both kinds of informa- 
tion can be obtained from sequence sets, and several techniques have 




Rg. 3. Tolerance of positions in the NH 2 -terminal domain of \ repressor to 
hydrophilic side chains. The complex (43) of the repressor dimer (blue) and 
operator DNA (white) is shown. In (A), positions that can tolerate 
hydrophilic side chains are shown in orange. The same side chains are shown 
in (B) without the remaining protein atoms. In (C), positions that require 
hydrophobic or neutral side chains are shown in green. These side chains are 
shown in (D) without the remaining protein atoms. About three-fourths of 
the 92 side chains in the NH 2 -tcrminal domain arc included in both (B) and 
(D). The remaining positions have not been tested. Data are from (9, 14, 20, 
27, 44). . 
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been used to combine such information into more appropriately 
weighted sequence searches and alignments (3t). These methods 
were used to align the sequences of retroviral proteases with aspartic 
proteases, which in turn allowed construction of a three-dimension- 
al model for the protease of human immunodeficiency virus type 1 
(29). Comparison with the recently determined crystal structure of 
this protein revealed reasonable agreement in many areas of the 
predicted structure (32). 

The structural information at most surface sites is highly degener- 
ate. Except for functionally important residues, exterior positions 
seem to be important chiefly in maintaining a reasonably polar 
surface. The information contained in buried residues is also 
degenerate, the main requirement being that these residues remain 
hydrophobic. Thus, at its most basic level, the key structural 
message in an amino acid sequence may reside in its specific pattern 
of hydrophobic and hydrophilic residues. This is meant in an 
informational sense. Clearly, the precise structure and stability of a 
protein depends on a large number of detailed interactions. It is 
possible, however, that structural prediction at a more primitive 
level can be accomplished by concentrating on the most basic 
informational aspects of an amino acid sequence. For example, 
amphipathic patterns can be extracted from aligned sets of sequences 
and used, in some cases, to identify secondary structures. 

If a region of secondary structure is packed against the hydropho- 
bic core, a pattern of hydrophobic residues reflecting the periodicity 
of the secondary structure is expected (33, 34). These patterns can be 
obscured in individual sequences by hydrophobic residues on the 
protein surface. It is rare, however, for a surface position to remain 
hydrophobic over the course of evolution. Consequently, the am- 
phipathic patterns expected for simple secondary structures can be 
much clearer in a set of related sequences (6). This principle is 
illustrated in Fig. 4, which shows helical hydrophobic moment plots 
for the Antennapedia homeodomain sequence (Fig. 4A) and for a 
composite sequence derived from a set of homologous homeodo- 
main proteins (Fig. 4B) (35). The hydrophobic moment is a simple 
measure of the degree of amphipathic character of a sequence in a 
given secondary structure (34). The amphipathic character of the 
three a-helical regions in the Antennapedia protein (36) is clearly 
revealed only by the analysis of the combined set of homeodomain " 
sequences. The secondary structure of Arc repressor, a small DNA- 
binding protein, was recently predicted by a similar method (8) and 
confirmed by nuclear magnetic resonance studies (37). 

The specific pattern of hydrophobic and hydrophilic residues in 
an amino acid sequence must limit the number of different structures 
a given sequence can adopt and may indeed define its overall fold. If 
this is true, then the arrangement of hydrophobic and hydrophilic 
residues should be a characteristic feature of a particular fold. Sweet 
and Eisenberg have shown that the correlation of the pattern of 
hydrophobic! ty between two protein sequences is a good criterion 
for their structural relatedness (38). In addition, several studies 
indicate that patterns of obligatory hydrophobic positions identified 
from aligned sequences are distinctive features of sequences that 
adopt the same structure (4 t 29, 38, 39). Thus, the order of 
hydrophobic and hydrophilic residues in a sequence may actually be 
sufficient information to determine the basic folding pattern of a 
protein sequence. 

Although the pattern of sequence hydrophobicity may be a 
characteristic feature of a particular fold, it is not yet clear how such 
patterns could be used for prediction of structure de novo. It is 
important to understand how patterns in sequence space can be 
related to structures in conformation space. Lau and Dill have 
approached this problem by studying the properties of simple 
sequences composed only of H (hydrophobic) and P (polar) groups 
on two-dimensional lattices (40). An example-of such a representa- 



tion is shown in Fig. 5. Residues adjacent in the sequence must 
occupy adjacent squares on the lattice, and two residues cannot 
occupy the same space. Free energies of particular conformations are 
evaluated with a single term, an attraction of H groups. By 
considering chains of ten residues, an exhaustive conformational 
search for all 1024 possible sequences of H and P residues was 
possible. For longer sequences only a representative fraction of the 
allowed sequence or conformation space could be explored. The 
significant results were as follows: (i) not all sequences can fold into 
a "native" structure and only a few sequences' form a unique native 
structure; (ii) the probability that a sequence will adopt a unique 
native structure increases with chain length; and (iii) the native 
states are compact, contain a hydrophobic core surrounded by polar 
residues, and contain significant secondary structure. Although the 
gap between these two-dimensional simulations and three-dimen- 
sional structures is large, the use of simple rules and sequence 
representations yields results similar to those expected for real 
proteins. Three-dimensional lattice methods are also beginning to 
be developed and evaluated (41). 



Summary 

There is more information in a set of related sequences dian in a 
single sequence. A number of practical applications arise from an 
analysis of the tolerance of residue positions to change. First, such 
information permits the evaluation of a residue's importance to die 
function and stability of a protein. This ability to identify the 
essential elements of a protein sequence may improve our under- 
standing of the determinants of protein folding and stability as well 
as protein function. Second, patterns of tolerance to amino acid 
substitutions of varying hydrophilicity can help to identify residues 
likely to be buried in a protein structure and those likely to occupy 



Fig. 4. Helical hydro- 
phobic moments calcu- 
lated by using (A) the 
Antennapedia homeodo- 
main sequence or (B) a 
set of 39 aligned homeo- 
domain sequences (35). 
The bars indicate the ex- 
tent of the helical re- 
gions identified in nucle- 
ar magnetic resonance 
studies of the Antenna- 
pedia homeodomain 
(36). To determine hy- 
drophobic moments, 
residues were assigned 
to one of three groups; 
HI (high hydrophobici- 
ty = Trp, He, Phe, Leu, 
Met, Val, or Cys); H2 
(medium hydrophobic- 
ity = Tyr, Pro, Ala, Thr, 
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His, Gly, or Ser); and H3 (low hydrophobicity « Gin, Asn, Glu, Asp, Lys, 
or Arg). For the aligned homeodomain sequences, the residues at each 
position were sorted by their hydrophobicity by using the scale of Fauchere 
and Pliska (45), Arg and Lys were not counted unless no other residue was 
found at the position, because they contain long aliphatic side chains and can 
thereby substitute for nonpolar residues at some buried sites. To account for 
possible sequence errors and rare exceptions, the most hydrophilic residue 
allowed at each position was discarded unless it was observed twice. The 
second most hydrophilic residue was then chosen to represent the hydropho- 
bicity of each position. An eight-residue window was used and the vectors 
projected radially every 100°. The vector magnitudes were assigned a value of 
1, 0, or - 1 for positions where the hydrophobicity group was HI, H2, or 
H3, respectively. 
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Fig. 5. A representation of one com- 
pact conformation for a particular 
sequence of H and P residues on a 
two-dimensional square lattice. 
[Adapted from (40), with permis- 
sion of the American Chemical Soci- 
ety] 



surface positions. The amphipathic patterns that emerge can be used 
to identify probable regions of secondary structure. Third, incorpo- 
rating a knowledge of allowed substitutions can improve the ability 
to detect and align distantly related proteins because the essential 
residues can be given prominence in the alignment scoring. 

As more sequences are determined, it becomes increasingly likely 
that a protein of interest is a member of a family of related 
sequences. If this is not the case, it is now possible to use genetic 
methods to generate lists of allowed amino acid substitutions. 
Consequendy, at least in the short term, it may not be necessary to 
solve the folding problem for individual protein sequences. Instead, 
information from sequence sets could be used. Perhaps by simplify- 
ing sequence space through the identification of key residues, and by 
simplifying conformation space as in the lattice methods, it will be 
possible to develop algorithms to generate a limited number of trial 
structures. These trial structures could then, in turn, be evaluated by 
further experiments and more sophisticated energy calculations. 
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Table 1 

Efficiency of suppression and specificity of insertion of 

tRNA suppressors 

Efficiency of 









suppression at 


Suppressor 


Gene 


Amino acid inserted 


UAG codons (%) 


Sul 


swpD 


Serine 


6-54 


Su2-89 


supE 
mipF 


Glutamine 


32-60 


Su3 


Tyrosine 


11-100 


Su5 


supG 


Lysine 


0*6-3; 5-29 


Su6 


supP 


leucine 


30-100 


tRNAjuA 


synthetic 


Alanine 


8-83 


tRNA§fc 


synthetic 


Cysteine 
Glutamic Acid (80%) 


17-54 


tRXAgtf 
tRNAg^ 1 


synthetic 


GluUmine (20%) 


8-100 


synthetic 


Glycine 


24-100 


tRNAcu^ 


synthetic 


Hist) dine 


16-100 


tKNA£g A 


synthetic 


Lysine 


i>-29 


t-RNAcxj^ 


synthetic 


Phenylalanine 


48-100 


tRNAgg 


synthetic 


Proline 


9-00 


FTOIRA26 


synthetic 


Arginine 


4-28; 4-47 



Data has been compiled from previous work on naturally 
occurring (Miller <fe Albertini, 1983) and synthetic (Kleina el al., 
1 990) suppressors. The efficiency of suppression is expressed as a 
range of the highest and lowest values obtained in different 
contexts. The Su2-89 suppressor is from ttradley eial. (1981), and 
the FTOR1 26 suppressor from McClain & Foss (J 988). 



amino acids at positions in a protein corresponding 
to a UAG site (Normanly et al., 1990; Kleina et al., 
1990; McClain & Foss, 1988), and an additional 
amino acid at UGA sites. Table 1 depicts the amino 
acids which can be added by suppression of 
nonsense mutations at reasonable efficiencies. 

The E. coli lac repressor is a te tram eric protein 
composed of four identical subunits, each consisting 
of 360 amino acids (Mueller-Hill, 1975; Farabaugh, 
1978). The repressor consists of several domains. 
The amino- terminal 59 amino acids contain the 
DNA and operator binding sites, and the remaining 
portion (60 to 329) of the molecule contains the 
binding site for inducer and the dimer association 
sites (Mueller-Hill, 1975; Piatt et al., 1973; Ogata & 
Gilbert, 1978; Schmitz et al, 1976; Miller & 
Schmeissner, 1979). The carboxyl-terminal 30-31 
residues (330 to 360) are not required for the forma- 
tion of active dimers, but are required for the dimer 
to tetramer transition -(Piatt et al., 1970; Alberti el 
al., 1991), and appear to contain a leucine zipper 
(Alberti et al., 1991; Chakerian et al., 1991). 
Numerous investigations have identified several 
classes of altered lat, repressors (Mueller-Hill, 1975; 
SchmiU el al, 1976; Miller & Schmeissner, 1979; 
Chamness & Willson, 1970; Miller, 1978) defective in 
one of these functions. Table 2 displays some of the 
phenotypes resulting from different lac! mutations. 

We previously described the use of nonsense 
suppressors on a set of 141 nonsense mutations in 
the lac! gene (Kleina & Miller, 1990), which encodes 
the 360 amino acid lac repressor monomer. This 
work resulted in the generation of close to 1600 
amino acid replacements. Using site -directed muta- 
genesis, we have constructed an additional 188 



Table 2 

Phenotypes resulting from different lad mutations 

Mutation 



Altered repressor 


Phenotypic 


dominant/recessive 


f u n ctio n /property 


symbol 


(d) (r) 


DNA binding 


I- 


d 


Folding 


I- 


r 


Aggregation 


I- 


r - 


inducer binding 


I 5 


d 


Allosteric transition 


I s 


d 


Tight binding to DNA 


r\ r 




Reversed al lust eric 






transition 






Stability increased by 






inducer binding 


r 





Some of the altered repressors which have been characterized 
are shown. The 1" designation is for different defective repressors 
which can no longer block transcription of the tan genes. The I s 
symbol denotes repressors that bind to operator but are not 
induced by I PTG. Several types of altered repressors display this 
characteristic, including those that bind operator more tightly. 
This latter class also displays a partial reverse induction profile 
(in which repression increases with increasing 1 PTG 
concentration), which is designated as I r (Chamness & Wilson, 
1970). U of I rc (reverse curve; Myers &, Sadler, 1971) repressors 
may also result from altered alio stem; transit-ion or from a 
stabilization of repressor by inducer binding. Temperature- 
sensitive derivatives of each type of repressor can also occur. For 
a more detailed description, see Kleina & Miller (1990), 
Mueller-Hill (1975) or Miller (1978). 



amber sites in the lacl gene. Here, we describe the 
effects of suppressing all of these mutations, span- 
ning residues 2 to 329. with each of the character- 
ized amber suppressors. (Amino -terminal fragments 
containing > 330 residues are partially or fully 
active in vivo and were not studied.) The replace- 
ment of 12-13 amino acids at each of the 328 amber 
sites results in over 4000 altered lac repressors, and 
allows a virtual genetic image reconstruction of the 
functional regions of the protein. We compare the 
results reported here from those of other large 
collections of altered proteins, including the 
pioneering studies of Perutz and co-workers on the 
hemoglobins and myoglobins (Perutz, 1965; Perut*/. 
et al., 1965; Perutz & Lehmann, 1968), studies of 
HIV-1 protease (Loeb et al., 1989), an N-terminal 
fragment of the lambda repressor (Bowie et al., 
1990), and T4 lysozyme (Renneil et al, 1991). 
Additionally, we compare the substitutional toler- 
ance of individual sites in the repressor with the 
evolutionary variability of those sites in homo- 
logous proteins. 



2. Materials and Methods 

(a) Genetic methodology 

All genetic methods and assays were carried out as 
described by Miller (1972) and Kleina & Miller (1090). 
Mutants were constructed as described previously (Kleina 
& Miller, 1990), except that in many cases we crossed 
mutations directly from fl onto the Vlacpro episome, 
without first cloning them onto, a plasmid. 
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(b) Correlation of tolerance and conservation 

The correlation coefficient, C B (Schulz & Schirmer, 
1979) ia defined as: 

c (»)(*)- (y)M 

+*)(«> +y)(u>+z) 

where, as a percentage, or fraction: to is positive correct 
prediction, x is negative correction prediction, y is 
underprediction, z is overproduction, and where 
w+x+y + z = I. Here, to is conserved, intolerant, x 
is not conserved, tolerant, y is predicted conserved, but 
not; i.e. non conserved, intolerant, z is conserved, tolerant; 

w « 97/328 = 0*30; x = 163/328 = O50; 
y = 37/328 = 0 11; z = 31/328 - 0-09. 
C 0-15-00099 
* N /(0-61)(0-59)(0-41)(0-39)' 
€ h — 0*58, which is highly significant (Matthews, 1975). 

(c) Scoring for -f and — among ottered repressors 

In order for a replacement to score as + , it must result 
in a repressor with greater than 8 to 10% activity. 
Although the version of the Su2 glutamine-inserting 
suppressor (Su2-89) that we are using operates with high 
efficiency, the normal Su2 operates at a very low effi- 
ciency at certain amber sites. When coupled with the 
amber site corresponding to residue 117, glutamine is 
inserted at only 0-8 to 10% efficiency (Miller & Albertini, 
1983; J. H. Miller, unpublished results). Since the 
repressor is overproduced 10-fold (Mueller-Hill et at., 
1968) in the strain employed, that means that glutamine 
is being inserted at 8 to 10% of the level of a typical wild- 
; type strain. Even though glutamine is the wild- type 
amino acid at position 117, repression is not fully restored 
with the normal Su2 suppressor. Therefore, replacements 
must result in greater than 8 to 10% activity in order to 
be scored as a full + . 

3. Results 

(a) Effects of amino acid replacements 

Figure 1 compiles all of the accumulated data for 
the 13x328 amino acid replacement matrix in the 
lac repressor. This Figure shows the fractional toler- 
ance to substitution for each residue in the protein, 
without considering each substitution in detail. 
More quantitative data on the effects of specific 
replacement are given by Kleina et al. (1990), and 
additional quantitative data will be published else- 
where. On examining the linear array of sites in the 
protein, several features strike the eye. It is clear 
that large portions of the protein (60%) are tolerant 
to substitutions. However, two subregions sensitive 
to replacements are evident. The first, consisting of 
the ammo-terminal 59 residues, has previously been 
shown to be involved in operator and DNA binding 
(Mueller-Hill, 1975; Piatt et al., 1973; Ogata & 
Gilbert, 1978), and contains the helix -turn -helix 
DNA binding motif seen in many regulatory pro- 
teins (McKay & Steitz, 1981; Kaptein et at., 1985). 
Analysis of the properties of sites in this region 
shows that the sensitivity to amino acid exchanges 
can be correlated with the three-dimensional struc- 
ture of this region as determined by NMR (Kaptein 



et a/., 1985), and depicted in Figure 2. Three helices 
are defined in the first 51 amino acids. Helix II is 
the recognition helix, including amino acids 16 to 23 
(Lehming et al. t 1987, 1988). It can be seen that ail 
amino acids in this helix are extremely sensitive to 
substitutions. It is particularly striking that for 
helix III, the residues whose side-chains point 
inwards making internal contacts (e.g. Val38, 
AIa4l, Met42 and Tyr47) are sensitive to replace- 
ments, while those facing the exterior (e.g. Giu36, 
Gln39, Ala43, Glu44 and Asn46) are very tolerant of 
substitutions. A similar pattern is seen for helix I, 
with Val9, AlalO and Alal3 intolerant, while other 
residues pointing outward are more substitu table. 
Thr5 is highly intolerant, but is believed to be in a 
close non-specific contact with the operator DNA 
(Kaptein et aL, 1985). The residues forming the 
loops between the helices are for the most part 
tolerant to substitutions. 

The second region containing many sites in- 
tolerant to substitutions extends from amino acids 
239 to 241 until amino acids 289 to 292. The experi- 
mental identification of this region as a crucial 
portion of the repressor has emerged from this 
study. We have carried out a computer search in 
Gen Bank and other data bases for proteins with 
significant homologies to E. coli lac repressor, and 
have found that the region of the repressor, aside 
from the amino-terminal DNA binding domain, 
which shares the greatest homology with other pro- 
teins is in fact in the region from residues 239 to 289 
(see below, and Figures 3-4). 

Figure 1 also reveals stretches of amino acids 
which appear to be almost completely tolerant to 
substitutions. For instance, very few substitutions 
between residues 100 to 112, 129 to 145, 151 to 160, 
206 to 217 or 305 to 318 appear to be deleterious. 
Throughout the protein, it appears as if long 
stretches of tolerant residues separate one or small 
clusters of sites which are sensitive to substitutions. 
In most of these cases the sensitive residues form 
hydrophobic clusters which are probably buried or 
partly buried in the interior of the protein. 

(b) Inducer binding 

Replacements resulting in the I s phenotype, 
inability to respond to inducer, are scored in the 
upper portion of Figure 1. Since the vast majority 
of sites in the repressor are not involved in inducer 
binding, only those sites where an effect occurs are 
shown. I s mutations affect binding of the inducer 
molecule and/or the allosteric transition that, 
reduces binding affinity to operator. The strongest 
effects are shown as darkened boxes (see legend to 
Figure 1). In addition to a large array of I s sites 
from amino acids 66 to 99, and scattered sites over 
the next 90 residues, the remainder of the protein 
contains distinct clusters of I s sites at regularly 
spaced intervals (see lower portion of Figure 1 and 
Miller et aL, 1979). Exactly how these sites are 
arranged in three dimensions will have to await the 
elucidation of the repressor structure. 
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Figure 2. Three-dimensional configuration of lac 
repressor headpiece. Kaptein and co- workers have used 
NMR spectroscopy to determine the structure of the 
amino-terminal 51 amino acids of lac repressor (Kaptein tt 
al., 1985). Based on coordinates kindly provided by Dr R. 
Kaptein, we have used a Silicon Graphics Personal Iris 
workstation to generate this view of the repressor head- 
piece. The first 2 helices (T and TI) are part of the helix- 
turn -helix motif seen in a number of regulatory proteins. 
The degree of tolerance to substitution, as derived from 
the data in Figure 1, is shown here. The residues most 
sensitive to substitution are shown in the darkest shadine. 



(c) Homologies with related proteins 

One interesting application of the substitution 
data for lac repressor is to compare it to the 
observed evolutionary variation of equivalent sites 
of related proteins. To this end, we generated a 
multiple alignment of proteins homologous to lac 
repressor. FASTA (Lipman & Pearson, 1985) was 
used in the initial search against the PIR and trans- 
lated GenBank (GenPepty protein data bases. 
FASTA recovered 16 proteins, all prokaryotic regu- 
latory proteins except for the periplasm ic d-ribose 
binding proteins of E. coli and Salmonella typhi- 
murium (see Figures 3-4). Additional alignments of 
the 330 to 360 regions with eukaryotie proteins such 
as with gag polymerase and filamentous proteins 
were rejected after examining the pairwise align- 
ments. Some of the matched proteins (gal, cyt 
repressor) have been aligned previously to lac 
repressor (von Wilcken-Bergman & Mueller-Hill, 
1982). Additional searches using BLAZK and 
BLAST (Altschul & Lipman, 1990; Brutlag, D. L. ( 
Dautricourt, J. -P., Diaz, R., Fier, J. & Stamm, R.' 
unpublished results) recovered 14 of the 16 proteins 



as highest scoring. Tt should be noted that other 
helix-turn -helix repressors such as lambda and cro 
do not appear in the matches. Though most of the 
matching proteins are bacterial repressors, only one 
of them, the Klebsiella pneumoniae lac repressor is 
identical in function to E. coli lac repressor. 

Pairwise alignments between lac repressor and the 
database proteins revealed a number of conserved 
regions in the sequence, two of which were the most 
prominent. One. centered on the amino-terminal 
region of the proteins, comprises the helix-turn- 
helix DNA recognition motif. This match was found 
for all the proteins in the group which bind DNA. 
The E.coli periplasmic d-ribose binding protein 
(Groarke et al., 1983) lacks this region, while the 
Vibrio ScrR protein ^Blatch & Woods, 1991) 
consists only of this region. The second large region 
with extensive homologies begins at approximately 
residue 239 (E. coli lac repressor numbering) and 
extends at least to residue 289. It comprises a new 
conserved region for this family of proteins. Based 
on these data, a multiple sequence alignment was 
generated using either CLUSTALV (Higgins & 
Sharp. 1988) or the progressive similarity alignment 
of Feng & Doolittle (1990). The two programs gave 
ver3' similar results. Various combinations of pro- 
teins and alignment parameters were used to ensure 
the robustness of the alignments. The maltose 
repressor (Reidl et a/., 1989) despite low homology 
in the 200 to 300 region, was aligned throughout its 
entire length, but the comparable region of the 
ribitol repressor (Wu et a/., 1985) could not be 
satisfactorily aligned. Therefore, only the first 60 
amino acids of this protein was used in the align- 
ment. No manual editing of the alignment was 
performed beyond these deletions. Figures 3 and 4 
present the results of the alignment. Above the 
alignment we have placed a box whose shading 
indicates substitutional tolerance for that site in . 
E. coli lac repressor, with intolerant residues black 
and tolerant ones white. This one- dimensional 
representation allows us to compare the conserved 
sites with the substitution data for E. coli lac 
repressor from Figure 1. An examination of 
Figures 2-3 shows a highly significant correlation 
between conserved sites in the alignment and sites 
which are intolerant to substitution in E. coli lac 
repressor, and conversely, between sites which are 
not conserved in evolution and sites which are 
tolerant to substitution. (We consider the implica- 
tions of this correlation below.) For instance, there 
are 134 (41 %) intolerant sites and 194 (59%) 
tolerant sites. Likewise, there are 128 (39%) 
conserved sites, and 200 (61 %) non- con served sites. 
If there were a random distribution of conserved 
and non-conserved sites, then among the intolerant 
sites we would expect 52 conserved sites, whereas 
we find 97, and we would expect 82 non-conserved 
sites, whereas we find 37. Similarly, among the 
tolerant residues, we would expect 76 conserved, 
whereas we find 31, and 118 n oh -conserved, whereas 
we find 163. These results are highly significant 
(X 2 = 156; p « 0*005). Also, the correlation coefli- 
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I Active 




■■ Inactive 




Figure 5. Deletions in the lacl gene. Different lengths of the lacl gene were deleted by site -directed mutagenesis (see 
Materials and Methods). The amino acids missing from the resulting repressors are shown here. Thus, delta 100-112 
means that amino acids 100 to 1 12 are missing from the repressor. Dark bars indicate that the repressor has no activity, 
and an open bar (as in the case of the single amino acid deletions at positions 208 and 313) indicates that the repressor 
has activity (see also Table 3). The information in Figure 1 has been summarised on the main line here to indicate the 
degree of tolerance to substitutions of different regions of repressor. The darker the shading, the more intolerant to 
substitution the region is. 



cient C a (Schulz &, Schirmer, 1979) is 0 58 (see 
Materials and Methods), which is highly significant 
(Matthews, 1975). 

(d) Regions of repressor that are tolerant 
to subsiitviions 

The substitution profile shows that segments of 
the protein are virtually insensitive to single amino 
replacements. What function do these stretches of 
up to 14 amino acids serve? To test whether these 
stretches could be eliminated, we deleted the 
residues comprising the substitution -tolerant 
regions using oligonucleotide-directed site-specific 
mutagenesis. Figure 5 shows the regions we deleted, 
In each case, the resulting repressor lost function in 



vivo. Namely, it could no longer repress the ioc 
operon. We also constructed mutants in which we 
replaced the amino acid stretches with runs of 
alanines, as depicted in Figure 6. We used alanine 
because it was the most common functional replace- 
ment in our data set. In several cases replacements 
of up to eight amino acids with alanine runs still 
resulted in repressors with apparently normal func- 
tion, since the lac genes were still IPTG inducible 
(see Table 3), Because some stretches can be 
replaced with a completely different amino acid 
sequence, but not deleted, it may be that these 
stretches serve to correctly space crucial residues in 
•the protein, and that a unique sequence is* not 
necessary for this function. Comparison to the 
three-dimensional structure, when it becomes avail - 



Figures 3-4. Multiple sequence alignment of proteins with sequence homology to lac repressor. scrR. Vibrio 
alginolyticvj sucrose uptake repressor (Blatch & Woods, 1991); rbt-R, ribitol operon repressor from Klebsiella (Wu «/oi., 
1985); loci, E, colt tar. repressor (Farabaugh, 1978); lacl* {klbR), Klebsiella pneumoniae, lac repressor (Buvinger & Riley, 
1985); cytR, E. coli cyt repressor (Valentin -Hansen et al. t 1986); purR t E. coli purine biosynthesis operon repressor (Rolfes 
& Zalktn, 1988); galR, E. coli galactose repressor (von Wilcken-Bergman et a!., 1982); ebgR, evolved beta-galactosidase 
operon repressor (Stokes A. Hall, 1985); rafR, E. coli raffinose operon repressor (Aslandis & Schmitt, 1990);/ru#, fructose 
phosphotransferase system repressor (Jahreis et a/., 1991; unpublished Gen Bank submission X55457); malR, E. coli 
maltose repressor (A)tshul et al., )990); rbsB, d-ribose periplasmic binding protein (Groarke et a/., 1983); graR, catabolite 
repression protein for alpha-amylase gene expression in B. Subtilis (Henkin, T. M., Grundy, F. J., Nicholson, W. L. & 
Champliss, G. H., unpublished GenBank submission M 85 182); oacG, asc operon regulatory protein in E. coli (Hall & Xu, 
1992); galS, isorepressor of the gal regulon in E. coli (Weickert & Adhya, 1992); galR*, H. influenzae gal repressor 
(Masked, D. J. f Szabo, M. J., Deadman, M. & Moxon, E. R., unpublished GenBank submission X65934); AUDI, 
amplification element of 5. lividans (Piendl, VV., Eichenseer, C, Viel, P., Altenbuchner, J. & Cullum, J., unpublished 
GenBank submission X65465). The alignments were generated using the algorithm of Feng & Doolittle (1990). Only the 
first 72 amino acids of the ribitol repressor were used in the alignment, since the program and CLUSTAL V (Higgins & 
Sharp, 1988) generated excessive gaps in the remainder of the sequence during several alignment runs. Regions of 
homology were boxed using a specific algorithm we developed to avoid "curve -fitting". First, for each column of amino 
acids at a particular site, scores from a normalized log-odds matrix (as in Feng & Doolittle, 1990) were averaged. 
Resulting scores ranged between 01 to 0*9, and columns with scores between 01 to 0-4 were boxed. Second, the number 
of the most common amino acid at any particular site was derived. The second score was used to correct for columns with 
log-odds scores > 0 40 that consisted of many identical amino acids (such as lysine) which have very low conservation 
scores in the log-odds matrix. Columns with log-odds scores between OA and 0*5 for which the number of the most 
common amino acids was > / = 5 were boxed. Columns with log-odds scores between 0 3 and 0-4 for which the number of 
the most common amino acid was < 5 were unboxed. Correlation coefficients were taken from the resulting conservations 
and substitutional tolerance. 
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Figure 6. Alanine replacements in the lac repressor. Site-directed mutagenesis was used to replace segments of the lac 
repressor with continuous runs of alanines. The numbers indicate the extent of the replacements. Dark bars indicate 
inactive repressor, open bars indicate active repressors, and shaded bars indicate partially active repressors. See a/so 
Table 3 and the legend to Figure 5. 



able, will allow the determination of whether the 
tolerant stretches are contained in specific struc- 
tural elements, such as surface loops or helices. 

(e) Substitution patterns 

Several substitution patterns are evident. (A 
complete tabular representation of the substitution 
data will be published elsewhere). It is clear that 
proline is not tolerated at many otherwise tolerant 



Table 3 

Assays of fi-galactosidase in strains with 
altered repressors 



Beta-galactoaidaBe activity 



loci region on plasmid 


No IPTG 


1PTC (U)- 3 M) 


None 


5600 


5500 


Wild-type 


41 


. 840 


Multiple alanine 






replacements 






102-108 


5-9 


660 


101-108 


22 


270 


140-144 


3-3 


2700 


151-158 


180 


. 850 


20$ -212 


2*4 


770 


234-238 


4-5 


1500 


311-316 


1-6 


760 


311-318 


1-4 


700 


305-316 


810 


1700. 


Deletion* 






100-112 


5000 . 


5500 


206-212 


8000 


8000 


311-318 


2500 


3160 



Plasmids carrying each of the mutated loci genes were 
constructed as previously described (Kleina & Miller, 1990), and 
put into a strain deleted for lac and pro, but carrying the lac 
region on an F* lacpru episome. The episQma) lac region carries a 
frameshift mutation, LVH (Calos & Miller, 1981), in the loci gene. 
Therefore, beta-galactosidase is synthesized constitutively from 
the episomal lac promoter, unless it is repressed by the / gene 
product synthesized from the plasm id, Beta-gal act osidasc was 
assayed after growth at 37 °C Assay conditions and units are as 
previously described (Miller, 1972). 



positions. At many of the sites shown in Figure 1 , 
only a single substitution partially or fully destroys 
protein function. In all cases, these represent the 
replacement of the wild-type residue by proline. Out 
of the 328 sites examined, 144 (44%) are extremely 
tolerant to substitutions, in that (excluding proline 
for the moment) they accept all 12 of the amino 
acids inserted by the nonsense suppressors. 
However, 51 (34%) of these otherwise tolerant sites 
do not tolerate proline. 

Another set of sites tolerates only hydrophobic 
amino acids, as shown by the examples depicted in 
Table 4. At a number of sites, only certain small 
amino acids (glycine, alanine, serine and cysteine, 
and sometimes threonine and valine) are tolerated, 
as summarized in Table 5. 



(f ) Reliability of data 

The efficiency of suppression is rarely 100%, and 
the efficiency of suppression varies from suppressor 
to suppressor and from cod on to codon (Miller & 
Albertini, 1983; Bossi, 1983), However, bv using the 
lad Q allele, which results in a tenfold over- 
expression of the repressor (Mueller-Hill et al. t 
1968), we can compensate for the lack of complete 
suppression in almost all cases. For the majority of 
suppressed proteins, the amount of repressor being 
produced varies within a threefold range, near to 
the normal amounts of wild-type repressor, since we 
are using a single copy F" carrying the lac I gene. 
Thus, we are not creating false positives by greatly 
overproducing the repressor. Also, the assays we are 
using (Kleina & Miller, 1990) to determine repressor 
function in vivo can recognize, as partially defective, 
repressors with as much as 8 to 10% activity (see 
Materials and Method). 

Leakiness of the amber mutation is not a factor in 
the experiments reported here. All amber mutations 
allow some level of transmission even in strains 
lacking known suppressors. Studies of fusion strains 
estimate this level to vary between 0*01 % and 2% 
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Table 4 

Substitution patterns of selected sites in the repressor 



Amino acid appearing at position 



Wild-type 


He 


Leu 


Phe 


Val 


Pro 


Cys 


Tyr 


Ala 


Gly 


Ser 


His 


Gin 


Arg 


Lys 


Glu 


Residue 
































Tle64 


{ + ) 


+ 


+ 




+ 




+ 


+ 






+ 










Val66 




+ 




< + ) 




+ 


+ . s 


+ 




ts 


+ 










Leu71 




< + ) 


+ 








+ 










ts 








tle!23 


( + ) 


+ 


+ 


+ 








+ 
















Ilel24 


( + ) 


+ 


+ 


+ 




+ 


+ , ts 


± 






+ 










Phe 147 






(+) 






± 


+ 


+ 






+ 


-, ts 








Phel61 




+ , ts 


( + > 




ts 










— , ts 












Leul74 




( + ) 


+ 






+ 


+ 


+ 




— , tB 




ts 








llel82 


( + > 


+ 


+ 




+ 




+ 


+ 
















Leu243 




< + ) 


+ 






+ 




















Val270 




. + 


+ 


< + ) 








+ 
















Leu286 




( + ) 




+ 




± 




±, ts 
















Tle289 


( + > 


+ 




+ 




+ 





















All amino acid replacements were made by using the amber suppressors indicated in Table 1. See also 
Kleina & Miller {1990). Different designations are used for the I phenotype. In general, + refers to no 
significant detected alteration (greater than 200-fold repression of beta-galactosidase in cases where 
measured); ± indicates altered repression, but retention of the ability to repress beta-g&lactosidase 
synthesis 20 to 200- fold; - usually designates less than 4-fold repression. These designations are only a 
rough guide. Substitutions resulting in a temperature-sensitive phenotype are indicated by ts. 
Substitutions resulting in a loss of response to inducer are indicated as s for I 1 repressors, or ws for a 
weaker Y phenotype. 



(Miller & Albertini, 1983). However, even with the 
overproducing I 0 - allele operating, none of the 
amber mutations in the lac J gene (from codons 2 to 
329) exhibit any measurable repressor activity in 
the absence of a suppressor under any experimental 
conditions we have employed. 

(g) Implications 

The work of Perut/, and co-workers on the hemo- 
globins and myoglobins (Perutz el al,, 1965; Perutz 
& Lehmann, 1968) established that in globular pro- 
teins the non -polar residues in the interior of the 
protein are not replaceable with polar amnino acids, 
but in many cases are replaceable with certain other 



non-polar amino acids. On the other hand, residues 
on the surface of the protein are usually freely 
exchangeable between non -polar and polar amino 
acids, unless they are part of the substrate or ligand 
binding site, or part of intersu burnt contacts. 
Among similar proteins in different species, residues 
at interior positions are more highly conserved than 
residues on the surface (Pertuz et al. } 1965). Work 
with extensive amino acid replacements in other 
proteins has reinforced these conclusions. Sauer and 
co-workers (Bowie et al. % 1990) have found, using 
combinatorial cassettee mutagenesis, that many 
residues in the phage lambda repressor amino- 
terminal domain can be freely substituted, but 
buried residues are more refractory to amino acid 



Table 5 

Effects of amino acid replacements at selected sites in the repressor 

Amino acid appearing at position 
Wild-type Leu Phe Pro Cys Tyr Ala Gly Ser Thr His Gin Arg Lys Glu 



Residue 
Ala57 
Gly65 
Glyl66 
Glv218 

. Gly225 
Ala241 
Ala250 
Gly252 
Gly272 
Ser279 
Thr287 
Thr288 
GIy297 
Thr328 



- ( + ) 



± ■ 

( + ) - 



±, ts - 



- ±,ts - 

- + • - 



+ 
+ 



- , ts - , ts - 


+ ■ - . . 
+ . - 




- WS — ' ■ 

+ - 


— , ws — 


+ - 

— — , W3 



(+>. 
(+>■ 

( + ) ±,ta 
+ 

- ( + ) 
+ ( + ) - 

+ , WS +..S ( + ) 



+ 

{+) 
(+) 



+ 
+ 



+ - 

+ + 

± (+) 

± ■ + 



+ 
+ 

± 

+ 



(+) 

'(+) 

(+j 



For details, see legend to Table 4, 
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exchanges. Only positions on the surface could 
tolerate replacements by polar residues, A recent 
study of the systematic replacement of amino acids 
at 163 of the 164 positions in bacteriophage T4 
lysozyme (Rennell et al. t 1991), using our amber 
suppressor system, has shown a very strong correla- 
tion between degree of substitutional tolerance and 
the degree of solvent accessibility. Residues which 
are exposed to solvent (and on the surface) are 
freely substitutable, whereas those which are not 
accessible (and buried) are intolerant to substitu- 
tion. In light of these and our previous studies 
(Miller et al., 1979; Kleina & Miller, 1990), the 
finding that a large number of sites in the repressor 
can be freely substituted is not surprising. In the 
repressor work reported here, 28% of the 328 sites 
tolerated all substitutions, 44% all substitutions 
except proline, and 59% most substitutions. The T4 
lysozyme study revealed that 55% of the sites 
tolerated all of the substitutions. The somewhat 
higher percentage of sites tolerating every substitu- 
tion can be partly attributed to the different sensi- 
tivities of the assays used (Rennell et al. 7 1991; see 
Materials and Methods). The form of the results in 
the two data sets is strikingly similar. For example, 
in both the repressor and lysozyme study, proline is 
frequently not tolerated at otherwise tolerant sites. 
Extensive substitutions have also been used to iden- 
tify important functional regions in other proteins. 
For examples, random mutagenesis and screening 
was used to detect missense mutations in each of the 
99 coding positions in the HIV-1 proteaBe, which 
revealed three important functional regions of the 
protein (Loeb et al. % 1989). 

One striking feature of Figure 1 is how it reveals 
the regions which are sentitive to substitution. Two 
sizeable regions, from residues 1 to 59 and from 
residues 241 to 289, are verv sensitive to substitu- 
tions, creating the I" phenotype. The amino- 
termina) 59 amino acids form a separate domain 
which includes the DNA and operator binding sites 
(Mueller-Hill, 1975; Piatt et al. t 1973; Ogata & 
Gilbert, 1978), which explains its intolerance to 
many amino acid substitutions. The second, 241 to 
289 region, has an unknown function. It is precisely 
this segment which shown the strongest conserva- 
tion among proteins related to the repressor 
(Figures 2-3). Note that this region also includes 
several clusters of I s sites, presumably defining part 
of the inducer binding site. 

A second striking feature of Figure 1 is the pre- 
sence of segments which are almost completely 
tolerant of single amino acid substitutions. This 
includes residues 100 to 112, 129 to 145, 151 to 160, 
206 to 217 and 305 to 318. These "open" regions are 
interdispersed between single residues or clusters of 
residues that are sensitive to substitution. The 
finding that the residues in the open regions cannot 
be deleted (Figure 5) but can be replaced by poly- 
alanine stretches (Figure 6) indicates that stretches 
of the sequence serve as spacers for the hydrophobic 
residues in the core. In fact, in the vast majority -of 
cases the sensitive residues separated by the spacers 



are hydrophobic amino acids, and the suggestion is 
that they form part of the hydrophobic core of the 
protein. Although the three-dimensional structure 
of the lac repressor is not yet known, X-ray crys- 
tallographic data on the structure is forthcoming 
(Pace et a/., 1990). The data in Figures 1, 5 and 6 
predict that the open regions which are replaceable 
by poly -alanine stretches may represent surface 
residues, since they can accept replacements of up 
to eight amino acids at a time. It should be noted 
that a method called "clustered charged-charged-to- 
alanine scanning mutagenesis" (Bennett et o/., 1991; 
Bass et ai, 1991, see also Wertman et cU. t 1992) 
surveys the surface of the protein by substituting 
alanines for all the charged residues within a five 
amino acid stretch. Our replacements complement 
the recent study of Matthews and co-workers (Heinz 
et al. t 1992), who replaced ten consecutive alanine 
residues in bacteriophage T4 lysozyme, in a segment 
known to form an alpha- helix on the exterior of the 
protein. The lysozyme retained normal function and 
stability. Our work differs in making the replace- 
ments using functional, rather than structural, 
criteria, and may therefore demonstrate other types 
of structure tolerant to substitutions. 

It is interesting to examine the nature of substitu- 
tional^ intolerant residues in the repressor. The 
amino-terminal 59 amino acids are particularly 
sensitive to amino acid replacements. However, if 
we look at the remainder of the repressor (actually 
from residues 60 to 329), it is evident that virtually 
all sensitive sites are either hydrophobic residues 
that can only be replaced by certain other hydro- 
phobic residues, or else small amino acids that can 
only be substituted by other small amino acids. 
Table 6 shows that only 6 of 73 hydrophilic residues 
in this region are sensitive to substitution, whereas 
42 of 97 hydrophobic residues and 28 of 93 small 
residues are sensitive to substitution. 

There is a strong, although not complete, correla- 
tion (see Materials and Methods) between the 
evolutionarily conserved regions of the repressor 
and the residues that are sensitive to substitution, 
as can be seen from Figures 2-3. In others words, 
the substitution map of the protein is strongly 
correlated with the "evolution map" of the lac 
repressor family. Considering that all but one of the 
proteins being compared are not lac repressors but 
other regulatory proteins, this correlation is remark- 
able. One possible implication of this correlation is 
that the sequence divergence of these proteins 
resulted from near neutral mutations which 
occurred relatively independently, with multiple 
compensating mutations being rare. In other words, 
the pattern of allowed amino acids for a particular 
site in a protein may remain constant even when up 
to 80% of the amino acids in the protein have 
changed during evolutionary divergence. A similar 
study of suppressor-generated changes in phage T4 
lysozyme showed that whereas 74 of 163 residues 
tested (45%) are sensitive to at least one substitu- 
tion, all 14 residues that are fully conserved among 
five phage -encoded lysozymes are sensitive to 
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substitutions (Poteete ei al., 1992). It should be 
noted that although the variability of a site 
generally correlates with the evolutionary vari- 
ability, this is not always the case. For instance, 
Tyrl7 and Gin 18 in the amino -terminal region of 
the repressor are intolerant to substitution yet vary 
considerably in the aligned proteins. These sites 
have both been implicated in controlling the speci- 
ficity of DNA binding (Lehming el a/., 1987, 1988). 
Since operator sequences vary among DNA binding 
proteins, one expects similar variation of amino 
acids conferring specificity. With respect to the 
above, it is interesting to note that the I s sites 
(residues presumably involved in inducer binding) 
are not well conserved. One might rationalize this 
result by arguing that the specific ligands being 
bound by each regulatory protein are different, so 
that parts of the binding site involved in specificity 
should be variable in the lac repressor family, hut 
intolerant to substitutions in any given protein from 
the set, for the tigand binding property. 

There are also a number of residues, such as 
Gly91, which are highly conserved in the aligned set 
but can be freely substituted in the repressor. Tn 
some cases these residues represent sites which 
simply do not conform to the general pattern, and 
in other cases may be due to specious homologies or 
unrecognized subtle functions of the repressor 
protein. 

We thank Drs Ponzy Lu, Mitch Lewis. David 
Eisenberg, and Brian Matthews for valuable discussions. 
We arc indebted to Dr Robert Kaptein for supplying 
the coordinates of the lac repressor headpiece shown 
in Figure 2. This work was supported by grants from 
the National Institute of Health (USHHS R01 GM 
43827-02) and from the Department of Defense 
(DAALO3-92-G-0I73). 
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abstract: Exhaustive-substitution studies, where many amino acid replacements are individually tested 
at all positions in a natural protein, have proven to be very valuable in probing the relationship between 
sequence and function. The broad picture that has emerged from studies of this sort is one of functional 
tolerance of substitution. We have applied thjs approach to barnase, a 1 10-residue bacterial ribonuclease. 
Because the selection system used to score barnase mutants as active or inactive detects activity down to 
a level that can be approached by nonenzyme catalysts, mutants that test inactive are essentially devoid 
of enzymatic function. Of the 109 barnase positions subjected to substitution, only 15 (14%) are vulnerable 
to this extreme level of inactivation, and only 2 could not be substituted without such inactivation. A 
total of 33 substitutions (amounting to 5% of the explored substitutions) were found to render barnase 
wholly inactive. The profoundly disruptive effects of all of these inactivating substitutions appear to 
result from either (1) replacement of a side chain that is directly involved in substrate binding or catalysis, 
(2) replacement of a substantially buried side chain, (3) introduction of a proline residue, or (4) replacement 
of a glycine residue. Although substitutions of these types are functionally tolerated more often than not, 
the system used here indicates that only these sorts of substitution are capable of single-handedly reducing 
catalytic function to, or nearly to, levels that can be achieved by nonenzyme catalysts. 



It is hoped that investigations of protein folding will 
ultimately yield a comprehensive understanding of the 
relationship between protein sequence and structure. Al- 
though this is an ambitious undertaking, it is only half of 
the larger effort aimed at elucidating the relationship between 
sequence and function. The other half, aimed at understand- 
ing the structure— function relationship, is no less ambitious. 

While a general solution to the grand problem of relating 
protein sequence to function will clearly be some time in 
the making, we do have at our disposal, in the form of natural 
proteins, thousands of specific solutions to this problem. It 
therefore makes good sense for us to glean as much raw 
data as we possibly can from these natural solutions. Given 
the complexities and subtleties of the sequence— function 
relationship, it also makes sense for us to collect these data 
in a manner that is entirely unbiased by any a priori 
expectations we may hold regarding the nature of that 
relationship. 

The simplest and most direct way to obtain such an 
unbiased data set is to produce and test a large collection of 
mutant proteins where all possible single-residue substitutions 
are represented. This might be termed the exhaustive- 
substitution approach. Since all single-replacement pos- 
sibilities are examined by this approach, the data lack any 
imprint of the experimenter's expectations, and they are 
complete in the sense that the entire molecule is examined 
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(there is no question that more information would be gained 
by examining all possible double or triple substitutions, but 
such increases in the level of substitution lead very quickly 
to impractically large numbers of mutants). 

A number of studies have captured the essence of this 
approach (1~8). One of the broad features to emerge from 
these studies (7, 4, 5, 7) and others (9, 10) is a positive 
correlation between the degree of solvent exposure at an 
amino acid position and the level of substitutional tolerance 
at that position. It has also become clear that natural proteins 
typically tolerate single substitutions, even nonconservative 
ones, at most positions without complete loss of function. 
A peculiar exception to this is the phage P22 Arc repressor, 
which was found to tolerate nonconservative substitutions 
at only 8 of its 53 positions (2). Perhaps the unusual 
behavior of this protein can be attributed in part to its 
unusually small size. 

Another factor, one that clearly affects the results of any 
exhaustive-substitution study, is the activity threshold, the 
minimum level of activity necessary for a mutant to be scored 
active. Because of the large number of mutants involved, 
these studies typically rely upon rapid in vivo screens that 
produce a binary (i.e., "active" or "inactive") indication of 
activity. It is generally possible to set the activity threshold 
at various levels within some range, the choice being more 
a matter of experimental convenience than necessity. For 
any experimental protein, a high threshold will lead to a more 
inclusive list of positions deemed to be functionally important 
than will a low threshold. 1 In previous studies, thresholds 
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have been in the range of 3— 30% of wild-type (WT) 1 activity 
{1,2,4-6). 

Barnase, a bacterial ribonuclease, provides an opportunity 
to apply the exhaustive-substitution approach using a very 
low activity threshold. The extreme autotoxicity of this 
enzyme (IT) allows direct selection of mutants with very 
low (approximately 0.1% of wild-type and lower) activities 
(72), thereby enabling us to directly identify residues that 
profoundly influence enzyme function. 2 Here we report the 
results of an experiment designed to identify all single-base 
missense mutations that affect barnase function to this extent. 

MATERIALS AND METHODS 

Strains and Plasmids. The Escherichia coli strains and 
the plasmid used in this work have been described previously 
(12). Plasmid pSNBR carries a synthetic barnase gene, 
synbar, that is interrupted by two amber stop codons. Strain 
C-la, a nonsupressing strain, is used to prepare pSNBR 
DNA. Strain MX383, a suppressing strain, reads amber stop 
codons as serine codons, causing a full-length product to be 
produced. 

Mutagenesis of synbar. The number of possible single- 
position substitutions for a protein the size of barnase is large 
enough to make individual preparation, sequencing, and 
testing of all single mutants impractical. Various approaches, 
all with strengths and weaknesses, have been employed in 
previous studies to overcome this difficulty. Three con- 
straints were dominant in our choice of a method for 
mutagenesis. First, since this is a study of the effects of 
single substitutions, we required that the chosen method 
primarily produce singly substituted variants. Second, since 
the synbar selection system selects clones producing inactive 
barnase variants, we needed a method that would minimize 
accidental introduction of frame-shift mutations (virtually all 
of which would pass selection). And third, since a particular 
sort of amber suppression (encoded by the supD allele) is 
an integral part of the synbar selection system (72), methods 
involving a large set of strains with various amber- suppres- 
sion phenotypes (7) could not be used. 

These constraints can be adequately satisfied by a method 
that uses mutagenic oligonucleotides containing suitably 
small amounts of "contaminating" bases. By incorporating 
these oligonucleotides in a manner that avoids blunt-end 
ligations, we are able to achieve a low background frequency 
of frame-shift mutations. The primary limitation to this 
approach is the fact that it restricts the substitution set at a 
particular position to the amino acids that can be specified 
with a codon that differs only by one base from the wild- 
type codon. For synbar, this means that the number of 
accessible substitutions per position ranges from 5 to 7 
(average = 6.2). Although most substitutions are inacces- 
sible by this method, the number and variety of accessible 
substitutions ensure that multiple nonconservative substitu- 



1 Abbreviations: fMet, TV-formylmethionine; FMOC, 9-fluorenyl- 
methoxycarbonyl; HPLC, high-performance liquid chromatography; 
PCR, polymerase chain reaction; RNase, ribonuclease; WT, wild-type. 

2 To avoid confusion, it should be emphasized that throughout this 
work the terms inactive, inactivating, functional, sensitive, tolerant, and 
related words refer to properties of barnase or its variants (often with 
respect to particular amino acid positions), or to the effects of amino 
acid substitutions on barnase, but not to the bacterial host or to the 
effects of barnase variants on the host cell. 



tions (in addition to more conservative ones) will be possible 
at all positions. This will enable us to obtain the desired 
information. 

The coding region of synbar was conceptually divided into . 
8 contiguous regions covering from 12 to 14 codons each, 
starting from the second codon. 3 With each region being 
treated separately, random base substitutions were introduced 
throughout the gene by using oligonucleotides prepared with 
mixed phosphoramidites. For each region, an oligonucle- 
otide was synthesized that spanned the entire region and 
extended 10 bases beyond in both directions. The 10-base 
extensions were synthesized so as to perfectly complement 
the corresponding regions on the template plasmid, pSNBR. 
The central portion of each oligonucleotide, corresponding 
to one of the 8 synbar regions, was synthesized with small 
concentrations of contaminating phosphoramidites to intro- 
duce base substitutions at a low frequency. This was 
achieved by preparing the four standard phosphoramidite 
solutions at double concentrations and transferring 1 1 8 //L 
from each of these bottles to a fifth bottle containing 20 mL 
of pure anhydrous acetonitrile. The synthesizer was pro- 
grammed to draw only from the pure bottles for the first 
and last 10 base positions of each oligonucleotide but to draw 
equal volumes from the appropriate pure bottle and the mixed 
bottle at each of the central positions. The resulting product 
is a mixed population of oligonucleotides where each 
particular mutation occurs as frequently in isolation as it does 
in combination with other mutations (the latter being 
undesirable for this work). 

In separate PCRs, each of the 8 mutagenic oligonucleotides 
was used with a single biotinylated oligonucleotide to amplify 
a large portion of plasmid pSNBR (Figure 1). After 
purification from agarose gels, the product DNA was 
methylated by incubation with E. coli Dam methylase. The 
nonbiotinylated strands were then separated fronvthe bioti- 
nylated strands by using Dynabeads M-280 . Strep tavi din 
(Dynal) according to the manufacturer's protocol. The 
nonbiotinylated mutant strands were retained for producing 
mutant plasmid clones. Two additional oligonucleotide 
primers, one biotinylated, were used to amplify a portion of 
pSNBR such that the nonbiotinylated strand from this product 
can anneal to any of the nonbiotinylated mutant strands to 
form a gapped-duplex plasmid molecule (Figure 1). After 
isolation of the nonbiotinylated strand as before, the gapped- 
duplex product was prepared by combining this strand with 
each mutant strand (in equal proportions), heating to 75 °C 
for several minutes, and allowing the mixtures to cool slowly 
to room temperature. 

The synbar Selection System. E. coli strain MX383 was 
transformed directly [by the. previously described protocol 
(12)] with the mixtures of gapped-duplex DNA. Dam 
methylation of the mutant strand causes the cell to use this 
strand as the reference in correcting mismatches (7 4), thereby 
ensuring that synbar mutations are preserved. Because of 
the extreme autotoxicity of synbar expression (77, 72), only , 
mutant genes producing largely inactive barnase variants 



3 Codon and residue numbering correspond to the sequence of mature 
wild-type barnase, where the N-terminal residue is Ala. The N-terminal 
fMet resulting from synbar expression is expected to be removed within 
the cell (73), yielding the desired wild-type sequence. Since fMet 
removal depends on the identity of the adjacent residue, we have left, 
the Ala codon (codon 1) undisturbed. 
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Figure 1 : Introduction of random base substitutions into synbar. 
The upper illustrations depict the two PCRs used to produce the 
DNA strands that are combined to form the gapped-duplex product 
(lower illustration). Plasmid pSNBR serves as the template in both 
PCRs. Bold half-arrows indicate primers. Filled circles represent 
biotin groups that are covalently attached to primers. Open circles 
indicate DNA that has been methylated by Dam methylase. A base 
substitution in synbar results in local mispairing in the final product 
(as shown). 

allow growth of the host cell. Consequently, inactivating 
mutations are directly selected by plating on Luria-Bertani 
agar with ampicillin (72). 

Sequencing of synbar Mutants, Plasmid DNA was pre- 
pared from overnight cultures of clones passing selection. 
Sequences of mutant genes were determined by performing 
cycle-sequencing reactions with dye-labeled dideoxynucle- 
otides (Perkin-Elmer) and analyzing products with an 
automated sequencer (Perkin-Elmer, model ABI 373). Clones 
with multiple substitutions or frame-shift mutations were 
excluded from our analysis. 

Determination of OVA] Activity. RNase activity has been 
reported for OVA1, a 15-residue peptide (NVMEERKIKVIL- 
PRM) corresponding to a portion of chicken ovalbumin (75). 
To perform a quantitative activity assay, the peptide was 
synthesized on a commercial synthesizer using FMOC 
chemistry and purified by HPLC. RNA-hydrolysis activity 
was determined by measuring the absorbance (301 nm) of a 
solution consisting of 100 mM Tris buffer (pH 7.6 at 25 
°C), 200 mM KC1, 5 mM MgCl 2 , and 2.8 mg/mL torula yeast 
RNA (type VI, Sigma), then adding OVA1 (to 30 //M), and 
monitoring the decrease in absorbance as a function of time. 
As a reference, a parallel reaction was performed by adding 
wild-type barnase to the same assay buffer. The relative 
activity of OVA1 was calculated from the ratio of the 
absorbance slopes of OVA1 and barnase during the initial 
steady-state phase of hydrolysis. 

RESULTS 

Analysis of Completeness. Figure 2 summarizes our 
findings, indicating all single mutants found to be inactive 
and all single mutants inferred to be active. Because a 
limited number of trials were used to identify inactivating 
substitutions, some such substitutions may, by chance, have 
escaped detection. Consequently, before the implications of 
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these results are considered, it is important to consider how 
closely they represent the ideal data set that would result 
from an infinite number of trials. 

One indicator of completeness is the proportion of identi- 
fied inactivating substitutions for which only one example 
has been isolated. A very incomplete collection would be 
dominated by these unduplicated examples, whereas they 
would become rare as the collection approaches complete- 
ness. Our initial plan was therefore to continue the process 
of collecting and examining mutants until duplicates of all 
examples had been obtained. However, the highly nonuni- 
form distribution of mutants following selection made it 
impractical to achieve this. 

The problem we encountered is illustrated in Figure 3, 
which shows all mutants recovered from one of the 8 
contiguous synbar regions. Codons 83 and 87 normally 
specify arginine residues. The base mixtures used to prepare 
the mutagenic oligonucleotide mixture (see Materials and 
Methods) are expected to give unaltered Arg codons in most 
plasmid molecules, with a small fraction of plasmids carrying 
a missense mutation at either of these codons. These mutant 
codons are expected to specify Cys, Ser, Gly, Leu, Pro, and 
His with equal frequency. Consequently, if all mutants 
passing selection were equally inactive, we would recover 
these inactive mutants at roughly equal frequencies. The 
fact that frequencies of recovery are highly nonuniform 
(Figure 3) suggests that significant variation in activity exists 
even among these inactive mutants. 4 Variation of this sort 
was sufficiently common in a previous study (72) that a third 
classification was defined for mutants that impaired cell 
growth without completely preventing it. 

Despite varying degrees of inactivity, however, the 64 
inactive-mutant isolates represented in Figure 3 clearly 
demonstrate that positions 83 and 87, where all accessible 
substitutions lead to inactivation, are considerably more 
important for barnase function than any of the other positions 
in the region depicted. While we cannot be certain that an 
inactivating substitution at position 93, for example, would 
not be found if another 100 clones carrying mutations in 
this region were processed, we can be certain that most of. 
these clones would carry substitutions at positions 83 or 87, 
and we can safely deem it highly unlikely that position 93 
is actually as sensitive to substitution as positions 83 and 87 
are. We can likewise be confident that positions showing 
limited sensitivity to substitution (positions 89—91) are less 
functionally critical than the highly sensitive positions. 
Moreover, because the identities of the inactivating substitu- 
tions found at these three positions are readily explicable in 
terms of the properties of the introduced side chains (at 
position 89, normally occupied by a well-buried leucine, Arg 
is the only accessible substitution that introduces a charged 
side chain; similarly, Asp is the only charged side chain of 
those accessible at position 90, normally occupied by a 
largely buried tyrosine; Pro introduces unusual backbone 



4 The biological interpretation of the observed nonuniformity is that 
mutants having activities very close to the threshold level may or may 
not kill the original transformed host cell, depending upon whether 
the cell is able to make the necessary metabolic adjustments to 
compensate for the harmful RNase activity. Cells that do make the 
adjustments form colonies [often smaller than normal colonies (72)], 
but these colonies will be underrepresented to the extent that the survival 
rates of the initial trans fonnants are reduced. 
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FIGURE 2: Summary of single-substitution results for barnase. The wild-type barnase amino acid sequence is shown in bold type, underlined 
letters indicating residues with side chains that interact directly with substrate {16). At each position, substitutions that were found to render 
the enzyme inactive are shown above the wild-type sequence. All other accessible substitutions (see Materials and Methods) are shown 
below the wild-type sequence. 

completeness is the proportion of sensitive positions for 
which only a single instance of recovering an inactive variant 
occurred. By this measure, the data set is seen to be 
complete; multiple examples (averaging 8.6 per position) 
were isolated for all of the 1 5 positions found to be 
vulnerable to inactivating substitution. 

. Paucity of Inactivating Substitutions. The most striking 
aspect of the results presented in Figure 2 is the scarcity of 
inactivating substitutions. Of the 109 positions examined, 
94 (86%) appear to tolerate all accessible substitutions, and 
only two (1.8%) are wholly intolerant of accessible substitu- 
tions. As discussed above, it is quite possible that some 
additional inactivating substitutions would be found if the 
process of mutant collection and examination were continued 
indefinitely. If so, this would decrease somewhat the fraction 
of positions showing complete tolerance. Furthermore, since 
our method of mutagenesis restricts the substitution set to 
about 6 substitutions per position, it is highly probable that 
some inactivating substitutions are inaccessible by this 
method. However, the overall picture of substitutional 
tolerance depicted in Figure 2 is not apt to be very different 
from the true picture (see above). In particular, the number 
of positions that are wholly intolerant of substitution is more 
likely to be lower (because of the restricted substitution set) 
than higher. 

DISCUSSION 

Extreme tolerance of substitution is not without precedent 
in studies of this kind. In a study of bacteriophage T4 
lysozyme, Rennell et al. (4) found that more than half of 
the positions (55%) tolerate all of the 12 or 13 substitutions 
tested. More remarkably, only one position of the 163 
examined in that study (0.6%) was found to be wholly 
intolerant of substitution. In another study, Wen et al. found 
that 75% of the 121 positions in a bacterial membrane protein 
tolerate nonconservative substitutions (6). Although func- 
tional tolerance of substitution is clearly an important theme 




Figure 3: Number of independent isolates of inactive barnase 
variants carrying substitutions from position 83 to position 96. The 
residue position is indicated on the horizontal axis. Vertical stacks 
indicate the number (vertical scale) of examples recovered for each 
accessible substitution in this region. Stacks corresponding to 
substitutions for which at least one example was recovered are 
labeled to indicate the introduced amino acid (see Figure 2 for 
accessible substitutions that were not recovered). 

constraints at position 91, normally occupied by a serine 
situated in the middle strand of a five-strand /?-sheet), our 
approach appears to effectively identify the most severely 
disruptive substitutions at partially sensitive positions. 

Since our primary aim is to identify the residues in wild- 
type barnase that most critically affect function, the overall 
substitutional sensitivity of each position is more important 
than the activity of any particular mutant. Some uncertainty 
at the level of individual substitutions can be accepted 
without losing the bigger picture at the level of amino acid 
positions because several substitutions are possible at each 
position. In light of this, a more appropriate measure of 
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to emerge from exhaustive-substitution work, counter- 
examples do exist. The most striking of these, phage P22 
Arc repressor, was noted above, as was the importance of 
the activity threshold in determining the outcome of an 
exhaustive-substitution study. 

A more thorough consideration of the significance of the 
activity threshold will be instructive at this point. We noted 
previously that experimental systems used in exhaustive- 
substitution studies typically allow the experimenter to 
choose a threshold level from a wide range of feasible values. 
This raises the question as to whether the distinction between 
active and inactive is actually arbitrary or whether there is a 
natural fixed reference point by which these terms might be 
defined. 

Significance of the Activity Threshold 

Distinctive Mechanism of Enzymes. Although DNA- 
binding proteins have been the subject of a number of 
important exhaustive-substitution studies (7, 2, 7), we will 
narrow our focus here to enzymes. This class of molecules 
catalyzes chemical reactions by doing what all catalysts do, 
namely, lowering the free energy of the transition-state 
complex. What distinguishes enzymes from other catalysts 
is the way in which they accomplish this, and consequently 
the magnitude of their effect. Unlike simple catalysts, 
enzymes employ a spatially extensive structure to bind 
reactants, placing them in a geometrically precise and 
catalytically optimal orientation (77, 18) relative both to each 
other (where multiple reactants are involved) and to the 
catalytic group or groups, which are typically integral to the 
enzyme. 

Tf separated from their protein scaffold, the catalytic groups 
alone may catalyze the same reaction in solution by, for 
example, simple acid or base catalysis. To demonstrate the 
purpose of the geometric scaffold, however, one need only 
compare rates of catalysis by small molecules under physi- 
ological conditions to rates of enzymatic catalysis for the 
same reactions. Rates of single-substrate reactions of 
biological relevance vary by many orders of magnitude when 
measured in neutral aqueous solutions in the absence of 
enzymes; first-order rate constants range from 1CT 1 to 10~ 16 
s* 1 for the set of reactions discussed by Radzicka and 
Wolfenden {19). In contrast, effective second-order rate 
constants (kcJK m ) for the corresponding enzymatic reactions 
appear to fall within a relatively narrow range, a typical value 
being 10 7 s -1 M" 1 (19). Using this value and a typical 
cytoplasmic concentration of 10" 6 M for an enzyme, we 
estimate a typical pseudo-first-order rate constant for enzy- 
matic reactions in vivo to be 10 s" 1 . This can be compared 
directly to the range of n on enzymatic rate constants given 
above, 5 indicating that the geometric role of the protein 
scaffold increases reaction rates by some 2—17 orders of 
magnitude, depending upon the reaction. 

Basal Activity as a Natural Unit of Measure. So universal 
and so crucial is this geometric aspect of enzymes that it 
might be viewed as a defining property of this class of 
molecules. It follows that a natural basal limit to enzyme 
activity would be a catalytic rate just above that which can 
be obtained without the spatial positioning employed by 
enzymes (i.e., the maximal rate of catalysis by a nonenzy- . 
matic mechanism defines a limit that can be exceeded only 
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FIGURE 4: Natural scale of activity for a typical enzyme-catalyzed 
reaction. The scale indicates catalytic activity in terms of basal 
enzymatic activity units (bu), as discussed in the text. In this 
example, the basal level of activity is 8 orders of magnitude below 
the activity of the wild-type enzyme. Circles indicate activities of 
hypothetical mutant enzymes. Open circles correspond to mutants 
that fail to outperform nonenzyme catalysts and thus function as 
nonenzymes. All other mutants exceed the basal enzymatic activity 
level (1 bu) and thus function as enzymes. The shaded box indicates 
a range of activity thresholds from 3 to 30% of wild-type activity. 
Activity thresholds in the shaded region allow efficient enzymes 
(filled circles) to be distinguished from less efficient enzymes 
(circled dots) and nonenzymes, but they do not allow nonenzymes 
to be distinguished from enzymes. 

by employing spatial positioning in the manner that is 
characteristic of enzymes). As discussed above,- enzymes 
typically exceed this basal limit by many orders of magni- 
tude, achieving catalytic perfection in some cases (19, 22, 
23). 

The scale of catalytic activity shown in Figure 4 uses basal 
activity as the unit of measure (1 basal enzymatic activity 
unit, bu, corresponds to the basal level of activity described 
above). The activity of the wild-type enzyme is taken to be 
8 orders of magnitude higher than the basal level (i.e., 10 8 
bu) so a typical enzyme might be represented. Activities of 
mutants resulting from single amino acid substitutions will 
span a wide range, from very close to the wild-type level to 

s Although Radzicka and Wolfenden (J 9) report intrinsic aqueous 
rate constants (where water is the only catalyst), these values are 
generally indicative of the level of catalysis that can be expected in 
rnicromolar aqueous solutions of nonenzyme solutes. The rate at which 
small-molecule solutes perform simple acid or base catalysis depends 
primarily upon their pK 9 and concentration (see, for example, the 
discussion of the Bransted equation in ref 20). In neutral aqueous 
solutions at room temperature (for the purpose of comparing them to 
enzymes), water is expected to have a more significant catalytic effect 
than any small solute present at rnicromolar concentrations, regardless 
of its pK n . Like small molecules, macromolecules can act as nonenzyme 
catalysts, often with the added advantages of substrate binding and local 
solvent exclusion. For example, Hollfelder et al. have demonstrated 
that the Kemp elimination is fortuitously catalyzed by serum albumins 
(27). At an albumin concentration of 1 ftM, however, their data indicate 
that catalysis by water is comparable to catalysis by the albumin (pH 
8.0 and 25 °C). At rnicromolar concentrations, then, even these more 
sophisticated catalysts typically fail to outperform water significantly. ■ 
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the basal level and lower, depending upon the nature of the 
change they introduce. The natural significance of 1 bu 
activity, however, provides a meaningful division of this 
range into two regions; mutants with activities above 1 bu 
continue to exceed the performance of nonenzyme catalysts, 
whereas mutants with lower activities do not. Thus, even if 
these less active mutants continue to bind substrate molecules 
specifically, they fail to bind them in such a way as to 
enhance catalysis, and they consequently fail to exhibit the 
characteristic property of enzymes. Conversely, mutants 
having greater than 1 bu activity, however suboptimal they 
may be, continue to exhibit the characteristic property of 
enzymes. 

Essential versus Refining Structural Features. In this 
sense, structural features removed upon substitution could 
be classed as refining features if the activity following 
substitution exceeds 1 bu. Essential features would then be 
those structural features that, upon removal, reduce activity 
to less than 1 bu. 6 Note that the distinction has significance 
because of the qualitative difference between enzyme 
catalysis and nonenzyme catalysis, and not because of the 
quantitative difference per se.. Indeed, for mutants in the 
vicinity of 1 bu activity, the quantitative differences in 
activities are relatively small. The qualitative distinction, 
however, remains important: where reversion of single 
mutants is concerned, the restoration of a refining feature 
can turn a poor enzyme into a good one, but it cannot turn 
a nonenzyme into an enzyme. 

In exhaustive-substitution studies, the activity threshold 
effectively divides all single mutants into two classes 
according to activity (above threshold — active; below 
threshold = inactive). Although estimates of basal enzymatic 
activities are not generally made in the course of these 
studies, in most cases the relative proximity of the threshold 
to the activity of the wild-type enzyme implies that the 
threshold is several orders of magnitude above 1 bu. For 
example, in the study of T4 lysozyme by Rennell et al. (4), 
the threshold is placed at 3% of wild- type activity, while in 
the study of ^-lactamase by Huang et al. (5), it is placed at 
about 30% of wild-type activity. This range of threshold 
levels is indicated in Figure 4. The wild-type activities of 
these two enzymes may be somewhat higher or lower than 
the value represented in Figure 4 (10 8 bu), but they are not 
apt to be many orders of magnitude lower. Consequently, 
we infer that the threshold levels in these studies are very 
much higher than the basal level of activity. These thresh- 
olds, then, would enable the researcher to distinguish refining 
features of relatively small effect from all other features (both . 
refining and essential), but they would not be useful for 
distinguishing essential features from refining features. 7 To 



6 1 1 is often appropriate to view amino acid substitutions as 
introducing features as well as (or instead of) removing them. For 
simplicity of expression, however, we are using the term "feature" in 
a broad sense to include both the presence and the absence of particular 
aspects of structure. The absence of a /?-carbon at position 52, for 
example, is a feature of wild-type barnase that is "removed" upon 
substitution at that position. 

7 A deletion study on the A chain of ricin (24) used a sensitive 
activity test capable of detecting activity down to 0.01% of the wild- 
type value. However, the extreme specificity of the reaction (cleavage 
of the N-glycosidic bond of a single adenosine base in the mammalian 
ribosome) suggests that the rate corresponding to 1 bu would be many 
orders of magnitude lower still. 



do this, one would need a system with an activity threshold 
in the vicinity of 10° bu. 

Essential Features Are Particularly Relevant to Protein 
Design. Tn the pursuit of a complete understanding of the 
relationship between sequence and function, studies of 
refining features are no less relevant than studies of essential 
features. Essential features, however, may be of particular 
importance to efforts in protein design. A reasonable 
approach to the design of a novel enzyme would be to aim 
for a rudimentary enzyme as the initial target and then to 
apply iterative mutation— selection methods (25—28) to 
optimize the crude initial design. Since the sorts of features 
that are important for the function of natural enzymes will 
presumably have to be incorporated into any successful 
designed enzyme, one might hope to facilitate the first stage 
of the design process by using information from exhaustive- 
substitution studies of natural proteins to guide the initial 
design. 

One might even be tempted to view an exhaustive- 
substitution study of a natural enzyme as an exercise aimed 
at determining the features that would need to be incorporated 
into a successful re-design of the same enzyme. However, 
an important limitation of exhaustive-substitution data in this 
regard is that they cannot be expected to reliably identify 
unimportant features (i.e., features that would not need to 
be included in a successful design). The reason for this is 
that the structural context in which single substitutions are 
made in an exhaustive-substitution study is that of the wild- 
type enzyme. Since the thermodynamic stability of natural 
proteins typically exceeds that which is necessary for 
function, we would expect there to be many destabilizing 
substitutions that are functionally tolerated when introduced 
in the context of the wild-type sequence. When the context 
is far less optimal, as would be the case for a crude initial 
design, the same sorts of substitutions might easily turn a 
weakly functional design into a nonfunctional design. 

Can exhaustive-substitution studies be used to identify 
features that would need to be included in a successful 
design? In answering this, we will consider refining features 
and essential features separately. Upon removal of a refining 
feature, a wild-type enzyme is simply transformed into a 
suboptimal enzyme. Though suboptimal, this enzyme may 
still be considerably more active than a rudimentary enzyme 
of the sort we might hope to create from an initial design. 
That being the case, it is quite possible that the refining 
feature in question may only serve as a refinement in the 
context of other more basic refinements. In a crude enzyme 
lacking these basic refinements, we cannot assume that this 
feature would have any significant effect on function. 
Refining features therefore do not generally provide useful 
information for the design of rudimentary enzymes. 

On the other hand, it seems inescapable that an essential 
feature of a wild-type enzyme will have a corresponding 
essential feature in any less optimal variant of that enzyme. 
That is, if some type of structural modification destroys 
enzyme function when it was initially optimal, it is difficult 
to imagine how the same type of modification would not 
have the same effect when applied to a suboptimal variant. 
An essential feature, then, points to a structural rule for the 
class of proteins that perform a particular function by means 
a particular fold. Exhaustive-substitution data would therfore 
be of considerable value in obtaining design rules, provided 
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FIGURE 5: Estimation of the activity threshold for the synbar 
selection system. The open circle indicates the RNA-hydrolysis 
activity of 0VA1, a 15-residue peptide corresponding to a portion 
of chicken ovalbumin (75). Because 0VA1 has an unusually high 
activity for a nonenzyme (0.002% of that of wild-type barnase), 
we take it to represent an optimal or near-optimal nonenzyme 
catalyst. Taking basal enzymatic activity to be somewhat higher 
than the activity of OVA 1 , we define the basal enzymatic activity 
unit as 1 bu ^ 0.01% of the wild-type activity. On this natural 
scale, the activity of OVA1 is 2.0 x 10 _1 bu, and the activity of 
bamase mutant E73A (filled circle) is 2.0 x 10 1 bu (29). As 
indicated by the shaded region, the activity threshold for the synbar 
selection system lies between these two values. 

that a basal activity threshold is used to distinguish essential 
features from refining ones. 

It should be noted, though, that essential features and 
design rules are different things, in that the former is 
equivalent to a sequence constraint that applies to a particular 
protein, whereas the latter is a more general constraint that 
applies to all possible sequences sharing the fold and function 
of that protein. While identification of essential features for 
a particular protein provides valuable information on the 
design rules for the whole class of proteins, these rules cannot 
necessarily be deduced from experiments on a single protein. 
Some amount of interpretation will therefore be necessary 
for tentative classwide design rules to be inferred from 
exhaustive substitution data. 

The synbar System Approaches a Basal- Activity Selection 
System. To perform an exhaustive-substitution experiment 
with a basal-activity threshold, one would need a simple 
screening or selection procedure capable of detecting activity 
down to a level that approaches nonenzymatic activity. This 
presents considerable , practical difficulties that may be 
insurmountable for many systems. The synbar selection, 
system described here is one system where the necessary 
sensitivity appears to be attainable, or nearly so. 

Figure 5 depicts the relationship between the synbar 
threshold and known enzymatic and houenzymatic rates of 
RNA hydrolysis. The barnase mutant E73A is nearly 3 
orders of magnitude less active than the wild-type enzyme 
because it lacks the side chain that normally acts as the 
catalytic general base in the first step of the hydrolysis 
reaction (30). Since this mutant tests active in the synbar 



system (12), the activity threshold for this system, must lie 
below the activity of the mutant, as indicated in Figure 5. 
As a lower bound to the selection threshold, we will consider 
a particular class of peptide catalysts.. 

Yanagawa and co-workers have demonstrated that some 
peptide fragments of barnase catalyze RNA hydrolysis (75). 
Although they drew the conclusion that this activity is 
relevant to the function of the whole enzyme, their demon- 
stration that completely unrelated peptides show the same 
activity undermines that conclusion. Their further demon- 
stration that for peptides to exhibit this activity they need 
only carry a net charge of +2 or more argues convincingly 
that the activity they have observed is not enzymatic in 
nature. However, with activities approximately 5 orders of 
magnitude below that of wild-type barnase, these are 
remarkably active catalysts for nonenzymes. Their ability 
to bind RNA (75) accounts, at least in part, for their catalytic 
performance. 

We have synthesized one of these peptide catalysts (see 
Materials and Methods) and determined its activity to be 
0.002% of that of wild-type barnase (mole-to-mole basis). 
Taking this to be an approximate upper- limit rate of 
nonenzymatic RNA hydrolysis under physiological condi- 
tions, we estimate basal enzymatic activity to be ap- 
proximately 4 orders of magnitude below the activity of wild- 
type barnase. This defines the basal enzymatic activity unit 
used in the scale of Figure 5. Under the reasonable 
assumption that barnase mutants must outperform nonenzyme 
catalysts in order to exhibit lethality (evidence that lethality 
requires essentially barnase-like structure, and hence proper 
enzyme function, is discussed below), we conclude that the 
selection threshold for the synbar system must lie above 0.2 
bu. Situated between 2.0 x 10" 1 and 2.0 x 10 1 bu, then, 
the threshold can reasonably be said to be in the vicinity of 
the basal level, 1 x 10° bu. Consequently, substitutions that 
lead to activities below the threshold (i.e., to nonlethal 
barnase variants) must reduce activity to, or nearly to, 
nonenzymatic levels. 

Inactivating Mutations in Perspective 

Inspection of the collection of mutants having this dramatic 
effect suggests that they fall into three classes (Table 1). The 
first of these, class I, includes all substitutions that replace 
a side chain known to be directly involved in substrate 
binding or catalysis. The 17 substitutions falling into this 
class all involve replacement of Arg83, Arg87, or Hisl02. 
As with all substitutions, there is a possibility of the local 
change at the site of substitution leading to a more extensive 
structural disturbance. However, because of the crucial and 
direct role of these three residues in function, such propagated 
disturbances would not need to be present to account for 
the effects of substitution. Therefore, class I substitutions 
will be excluded from subsequent classes, even if they might 
otherwise meet the criteria for inclusion. 

Class II includes all substitutions (not in class I) that 
replace a side chain that is substantially buried (i.e., < 10% 
solvent-exposed) in the wild-type structure. Although the 
number of mutants falling into this class is similar to the 
number in the previous class, the number of positions 
involved is significantly greater. Consequently, positions that 
contribute to this class tend to be considerably less vulnerable 
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Table 1 : Classification of Inactivating Substitutions 



WT solvent exposure of 



residue 


the WT residue (%)° 


class I 


class II 


class III 


Tyr24 


9 (1) 




D 




Leu42 


0 (0) 




p 




Ala46 


2(0) 


_ 


p 


p 




4 




V 


v 

V 


Gly53 


27 


— 




v ; 


Trp71 


2(2) 




C,S 




Arg72 


23 (28) 






p 


,Ala74 


0(0) 




P 


p 


Asp75 


0(0) 




A, H,Y, V 




Arg83 




C, G, H, L, P, S 






Arg87 




C, G, H, L, P, S 






Leu89 


0(0) 




R 




Tyr90 


7(7) 




D 




Ser91 


1 (0) 




P 


p 


Hisl02 




D, N, Q, R, Y 







" Solvent-accessible surface areas, calculated by the method of Lee 
and Richards (57), are given as percentages of the areas of each amino 
acid X in an extended Gly-X-Gly tripeptide (32). The first value for 
each position applies to the entire residue; values in parentheses apply 
to side chains alone (omitted for glycines). Exposure values are not 
given for residues that contact substrate because inactivating substitu- 
tions at these positions are exclusively assigned to class I. 



to inactivating substitution (i.e., only a small fraction of the 
accessible substitutions destroy function) than the positions 
involved in class I substitutions. Position 75, where 4 of 7 
possible substitutions are inactivating, provides a possible 
exception to this. In the wild-type enzyme, the aspartate 
side chain at this position forms a salt bridge with the 
arginine side chain at position 83, one of the 3 positions 
that account for all class I substitutions. The extreme 
sensitivity to modification at position 83 suggests that the 
sensitivity exhibited at position 75 might be due to the close 
interaction between these two residues. 

Many of the substitutions in class II involve replacement 
of a hydrophobic side chain with a polar or charged one. 
The exceptions, to this, however, are sufficiently numerous 
that they suggest another cause of inactivation. Thus, class 
III includes all substitutions (not in class I) that either 
introduce a proline residue or replace a glycine residue. 
Because the proline side chain places unusual restrictions 
on backbone conformation and the glycine side chain does 
just the opposite, both types of substitution have a strong 
tendency to introduce local backbone distortion. As shown 
in Table 1, 6 substitutions can be placed into class III, several 
of them also falling into class II. Since there are 66 positions 
where Pro is not an accessible substitution (see Figure 2), 
the true size of class III is probably somewhat larger. 

It is noteworthy that these three classes give a complete 
description of the kinds of single substitutions that can 
destroy this enzyme. Considering the level of functional 
impairment required by synbar selection, we would expect 
the disruptive effects of these sorts of substitutions to be 
evident from previous work on other proteins. This is indeed 
the case. Earlier exhaustive-substitution studies, for example, 
have demonstrated the functional importance of particular 
active-site residues (4—6), the higher functional sensitivity 
at buried positions (J ,4,5, 7), the unusually disruptive nature 
of proline (7), and the unusual sensitivity to substitution of 
particular glycine residues (3, 4). What the synbar system 
reveals is the extent to which the corresponding structural 
features (active-site groups, buried side chains, and local 
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backbone conformation) are essential to enzyme function at 
the most basic level. 

Four primary conclusions are evident in this regard. First, 
it is a rare single substitution that is capable of destroying 
barnase function (i.e., reducing it to a level that can be 
approached by nonenzymes). Only about 5% of the acces- 
sible substitutions in this study were found to have an effect 
so severe. Second, in all cases where a substitution does 
have this effect, the cause (in broad terms) appears to be 
either (a) direct modification of the active site, (b) noncon- 
servative replacement of a buried side chain, or (c) introduc- 
tion of non-native local backbone constraints. Third, even 
among substitutions falling into these categories, elimination 
of enzyme function is atypical. For example, of the eight 
positions where side chains interact directly with substrate 
(indicated in Figure 2), only three are vulnerable to inactivat- 
ing substitution, and only two of those appear to be wholly 
intolerant of substitution. Finally, all of the wild-type 
residues that exhibit extreme sensitivity to substitution (i.e., 
where a substantial majority of the accessible substitutions 
eliminate enzyme function) interact directly with substrate. 

Having surveyed the set of inactivating substitutions, we 
can examine an important point that was presumed to be 
true in the previous section, namely, that barnase mutants 
must function as enzymes to exhibit lethality in the synbar 
selection system. The mere fact that some single substitu- 
tions can render barnase nonlethal in this system suggests 
that proper enzymatic activity [as opposed to the hydrolytic 
activity exhibited by peptides such as OVA1 (Figure 5)] is 
required for lethality. The fact that all instances of inactiva- 
tion can be explained with reference to particular aspects of 
the structure and mechanism of wild-type barnase strengthens 
this conclusion because it indicates that these aspects are 
necessary for lethality. The conclusion becomes even more 
compelling when we consider what the inactivating substitu- 
tions tell us about the role of larger structural elements in 
forming a lethal protein. 

Of the 15 positions that are sensitive to substitution, 12 
are not involved in any direct interaction with substrate. 
Inactivation by substitution at these 12 positions must 
therefore result from propagated structural disturbances. As 
shown in Figure 6, these points of structural sensitivity are 
distributed among four elements of secondary structure (the 
third of three helices and the first three of five ^-strands). 
Since the enzyme can be rendered nonlethal by propagated 
structural changes arising from substitutions in these struc- 
tural elements, we can safely conclude that these elements 
must be at least partly formed for lethality to be possible. 
By preparing and testing truncated mutants lacking either 
the major a-helix (helix 1) or the final ^-strand, we have 
further determined that these two elements must be present 
for lethality to be possible (D. D. Axe, unpublished result). 
This implies that all five strands of the sheet must be in place 
(the cooperative nature of ^-sheets makes it implausible that 
a single internal strand could be unformed). Thus, of the 
eight elements of secondary structure in barnase, seven must 
be at least partly formed for a mutant to be lethal. Taken 
together with the three sensitive active-site positions, this 
means that the molecule must be largely intact for the host 
cell to be killed by it, and this confirms our earlier 
presumption that mutants scored as active are true enzymes. 
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Figure 6: Location in the barnase structure of the 15 positions 
found to be vulnerable to inactivating substitution. Orange indicates 
the 3 positions where the side chain interacts directly with substrate 
in the wild-type enzyme (16). The other 12 sensitive positions are 
colored magenta. In addition, activity has been shown to be 
eliminated by both N-terminal and C-terrninal truncations. The 
missing portions in these inactive constructs are shown in blue and 
green, respectively. 

Issues Calling for Further Study 

A number of interesting questions are raised by the results 
of this work. Class I is probably of more interest for what 
it does not contain than for what it does. That either of the 
active-site residues E73 or HI 02 [normally filling the roles 
of catalytic general acid and base in the two-step reaction 
(30)] can be replaced without destroying the enzyme raises 
the interesting question of how the enzyme compensates for 
their absence. Of particular interest among class II substitu- 
tions are those where the introduced side chain is not highly 
hydrophilic and the substitution does not fall into class III. ; 
W71C is the best example of such a substitution. 

This tryptophan side chain normally packs against a cluster 
of other hydrophobic side chains to form a small hydrophobic 
core (33). Evidently, the structural shifts that result when a 
much smaller side chain is substituted can dramatically 
impair function. Although arginine is far more hydrophilic , 
than cysteine, our inability to recover a W71R mutant 
suggests that the hydrophobic portion of the arginine side 
chain is a better tryptophan substitute in this case than the 
cysteine side chain is. Since W71C is inactive, one would 
have thought that W71G. would also be inactive. As 
discussed above, we cannot conclusively declare a mutant 
to be active on the basis of our inability to recover it. It is 
possible, then, that W71G is inactive despite the fact that it 
was not recovered (Figure 2). The best way to conclusively 
verify that a particular mutant is active is to prepare the 
appropriate mutant plasmid and apply the synbar test as a 
screen (as in ref 72). 

Among the interesting class III substitutions are the ones 
that replace either of the two glycines at positions 52 and 
53. At both positions, valine is seen to.be inactivating, but 
a number of nonconservati ve substitutions apparently do not 
completely eliminate activity. The fact that numerous 
independent examples of. the valine substitutions were 
recovered (seven examples of G52V and four examples of 
G53V) suggests that most of the mutants that were not 



Biochemistry, Vol 37, No. 20, 1998 7165 

recovered really are active. The presence of a /?-bulge (a 
/?-sheet distortion caused by a surplus residue in one strand) 
involving residues 53 and 54 (34) raises the interesting 
possibility that this small structural element may be important 
for barnase function. The results of a detailed investigation 
of the effects of substitutions in this region will be reported 
elsewhere. 

Design Implications 

In light of the above discussion on the significance of 
essentia] features, we should now consider what the results 
of this study imply for the design of a barnase-like enzyme; 
For the reasons indicated previously, we must here focus on 
those features that appear to be essential, recognizing that 
the list of these will probably be incomplete. The first 
implication to consider is that any protein design aspiring 
to emulate the fold and mechanism of barnase will need two 
arginines to fill the roles of R83 and R87, and probably a 
histidine to fill the role of HI 02 as well (HI 02 does not 
appear to be truly essential, but it is sufficiently sensitive to 
substitution to suggest that it may be essential in anything 
but a highly optimal context). As discussed above, though, 
essential features of wild-type barnase cannot generally be 
construed as design rules for barnase-like enzymes. Since 
the experiment described here looks at single mutants only, 
it does not rule out the possibility that one or more of these 
active-site residues might be replaceable in the context of 
appropriate compensating substitutions. If we view this work 
more broadly, though, as giving us a picture of what kinds 
of single-residue features are indispensable in the context 
of a natural enzyme, we conclude that a barnase-like enzyme 
with barnase-like activity will have a few indispensable 
active-site residues, their exact identities possibly varying 
for various designs. A less optimal design, striving only for 
basal enzymatic activity, would at the very least need to have 
these few residues in their proper spatial orientations. 

Conceivably, this is essentially all that is needed for an 
enzyme to have basal activity, the trick being to design a 
scaffold that holds the few key residues in their proper 
orientations. That is, it is reasonable to view direct inter- 
action with substrate as a prerequisite for a residue to be 
considered to have a direct role in function. The role of the 
remaining residues, the scaffold residues, is then to impart 
the necessary orientations, structural dynamics, and chemical 
properties to the residues on the "front line". Again, though, 
experiments with single substitutions cannot be expected to 
give a full picture of the complexity of this front line. It 
may well be that some of the barnase residues known to 
interact directly with substrate but found here not to be 
vulnerable to inactivating substitutions (e.g., E73) would be 
essential in a less optimal context. What we conclude from 
this study, then, is that none of the scaffold residues are 
irreplaceable in the context of an otherwise wild- type 
sequence (even D75, the scaffold residue found to be most 
sensitive to substitution, can be replaced without eliminating 
activity). Studies involving combined substitutions (as in 
ref 12) will provide a clearer picture of the sequence 
requirements for basal barnase function. This work consti- 
tutes an essential first step, as it provides information that is 
needed for the design of multiple-substitution experiments. 
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Systematic Mutation of Bacteriophage T4 Lysozyme 
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Amber mutations were introduced into every codon (except the initiating AUG) of the 
bacteriophage T4 lysozyme gene. The amber alleles were introduced into a bacteriophage 
P22 hybrid, called P22 e416, in which the normal P22 lysozyme gene is replaced by its T4 
homologue, and which consequently depends upon T4 lysozyme for its ability to form a 
plaque. The resulting amber mutants were tested for plaque formation on amber suppressor 
strains of Salmonella typhimurium. Experiments with other hybrid phages engineered to 
produce different amounts of wild -type T4 lysozyme have shown that, to score as 
deleterious, a mutation must reduce lysozyme activity to less than 3% of that produced by 
wild-type P22 e416. Plating the collection of amber mutants covering 163 of the 164 codons 
of T4 lysozyme, on 13 suppressor strains that each insert a different amino acid residue in 
response to the amber codon, tests the effects of multiple single amino acid substitutions at 
every position in the protein (except the first). Of the resulting 2015 single amino acid 
substitutions in T4 lysozyme, 328 were found to be sufficiently deleterious to inhibit plaque 
formation. More than half (55%) of the positions in the protein tolerated all' substitutions 
examined. Among (N-terminal) amber fragments, only those of 161 or more residues are 
active. 

The effects of many of the deleterious substitutions are interpretable in light of the known 
structure of T4 lysozyme. Residues in the molecule that are refractory to replacements 
generally have solvent-inaccessible side-chains; the catalytic Glull and Asp20 residue^ are 
notable exceptions. Especially sensitive sites include residues involved in buried salt bridges 
near the catalytic site (Asp 10, Argl45 and Argl48) and a few others that may have critical 
structural roles (Gly30, Trpl38 and Tyrl61). 

Keywords: amber mutations; single amino acid substitutions; critical residues 



1. Introduction 

Bacteriophage T4 lysozyme mutants have been 
useful in studies of basic questions in molecular 
biology. Combined genetic and protein sequencing 
studies of T4 lysozyme mutants were significant in 
securing our present understanding of the genetic 
code (see Streisinger et al. } 1966). More recently, 
combined genetic and structural studies of this pro- 
tein have yielded insights into the structural deter- 
minants of protein stability (for a review, see 
Matthews, 1987). T4 lysozyme is an especially suit- 
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able protein for structural studies. The structure of 
the wild-type enzyme has been determined crystal- 
lographically and refined to a resolution of 1*7 A 
(lA = 0*lnm; Remington et al., 1978; Weaver & 
Matthews, 1987). Moreover, many mutant variants 
of T4 lysozyme have been found to form isomor- 
phous crystals of high quality; it has been possible 
to examine closely the structural effects of numer- 
ous amino acid substitutions in this protein (see 
Alber & Matthews, 1987). T4 lysozyme is a likely 
object for studies aimed at uncovering the sequence 
determinants of protein structure, a subject some- 
times called the second half of the genetic code. 

Previous studies of mutant T4 lysozymes have 
generally focused on small numbers of mutations. 
Cumulatively, many mutants have been described, 
but the relative contributions to biological function 
of many residues in the molecule are unknown. A 
systematic survey of the effects of single amino acid 
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substitutions in T4 lysozyme would generate a func- 
tional map, which could be informative when corre- 
lated with the structure. Ideally, such a survey 
would include all 19 single amino acid substitutions 
at every position, but even for a protein as small as 
T4 lysozyme (164 amino acid residues: Tsugita & 
Inouye, 1968), over 3000 variants would have to be 
studied. To reduce the scope of the technical under- 
taking, we employed the approach used by Miller 
and co-workers in studies of the Escherichia coli lac 
repressor (Miller et ah, 1979; Kleina & Miller, 1990). 
In this approach, amber mutations are introduced 
into the gene encoding the protein in question, and 
the resulting mutant allele is tested in suppressor 
strains that insert different amino acids in response 
to the amber codon. In this way, it is possible; with 
each single mutant allele of the gene, to test a 
number of single amino acid substitutions equal 
to the number of available amber suppressor 
specificities. 

We have introduced amber mutations into all but 
the first codon of the T4 lysozyme gene. The 
resulting set of 163 mutant phages has been plated 
on a set of 13 amber suppressor strains, each of 
which inserts a different amino acid. The suppres- 
sors employed in these studies include four natur- 
ally occurring (Winston et al., 1979) and nine 
synthetic amber suppressors (Normanly et ah, 1986, 
1990; Kleina et ah, 1990; McClain & Foss, 1988). 
These platings permit a screening of the functional 
effects of multiple single amino acid substitutions at 
every position in a protein of known structure. In a 
related study, Loeb et al. (1989) tested the effects of 
one to ten different amino acid substitutions at 
every position in the HIV-1 protease. The ^effects of 
a great number of amino acid substitutions in the 
E. coli lac repressor have been reported (Miller et al., 
1979; Kleina & Miller, 1990); however, the structure 
of this protein has not been determined crystallo- 
graphically. An N-terminal fragment of X cl 
repressor (Reidhaar-Olson & Sauer, 1988; Bowie et 
al., 1990) has been studied by combinatorial 
cassette mutagenesis. This procedure produces a 
related data base, namely, which combinations of 
amino acid substitutions are tolerated by a protein. 
We compare the trends in the data presented here 
for T4 lysozyme with findings derived from these 
similarly comprehensive collections of mutants. 

2. Materials and Methods 

(a) Media, enzymes and buffers 

LB broth contained 10 g tryptone, 5 g yeast extract, 5 g 
NaCl and 1 ml 1 M-NaOH/1; LB agar contained, in addi- 
tion, 11 g agar/1. It was supplemented with tetracycline at 
5 to 15 /ig/ml, or ampicillin at 50 to 100 /ig/ml where 
appropriate. Lambda agar contained 10 g tryptone, 2*5 g 
NaCl and 11 g agar/1; top agar was identical, except 
for containing 9 g agar/1. M9 minimal medium was 
42 mM-Na 2 HP0 4 , 20mM-KH 2 PO 4 , 8'5mM-NaCl, 
18*7 mM-NH 4 Cl, 1 mM-MgS0 4 , 50 ^g thiamine/ml and 
4 mg glucose/ml; minimal agar contained, in addition, 
15 g agar/1. Where indicated, leucine was added at 
40/ig/ml. 



Restriction buffers, kinase buffer, ligase buffer and 
polymerase buffers were as recommended by the 
suppliers. SB was 0-4 M-Na 2 HP0 4 , 2-2 m-KH 2 P0 4 ; Bs 
(buffered saline) was made by mixing 9 vol. 0-85% (w/v) 
NaCl with 1 vol. SB. 5xAB was 250 mM-Tris ■ HQ 
(pH 8-3), 300 mM-NaCl, 50 mM-dithiothreitol. 5 x RT was 
250 mM-Tris * HC1 (pH 8 3), 300 mM-NaCl, 50 mM- 
dithiothreitol, 150 mM-magnesium acetate. Each ddNTp 
mixture was made with 3 pi of 2*5 mM-ddNTP, 10 /d 0 f 
5 x AB and 37 /*! of water. The dNTP mixture contained 
each dNTP at 2 mM in AB buffer. Reverse transcriptase 
mixture contained 4-5 fil reverse transcriptase (Promega; 
7 units//il)- in 40 fi\ of RT buffer. Sequencing gels and 
related buffers were as described by Sambrook et al. 
(1989). 

T7 DNA polymerase and restriction enzyme Hpal were 
purchased from Boehringer-Mannheim. T4 DNA ligase, 
Klenow fragment of E. coli DNA polymerase I, T4 poly- 
nucleotide kinase, and other restriction enzymes were 
purchased from New England Biolabs. Low -temperature 
gelling agarose (SeaPlaque) was purchased from FMC 
Bioproducts; [y- 3 2 P] ATP (6000 Ci/mmol) was purchased 
from New England Nuclear. 



(b) Plasmids 

The amber suppressor-bearing plasmids pGFIB : sup 
GLY, pGFIB : sup ALA, pGFIB : sup HIS A, pGFIB : sup 
GLU, pGFIB:sup LYS and pGFIB:sup PRO H 
(Normanly et al., 1990) were obtained from J. Miller; 
pFTORl A26 (ARG) (McClain & Foss, 1988) was obtained 
from W. McClain; pDR463 and pDR464 were derived 
from pGFIB :Phe. and pGFIB:Cys (Normanly et al, 
1986), and have been described (Rennell & Poteete, 1989). 
Plasmid pZ152 (Zagursky & Berman, 1984) was used as a 
cloning vector. 

Plasmid pLH416 (Hardy & Poteete, 1991; see Fig. 1) 
was used as the target for introduction of amber 
mutations into the T4 lysozyme gene (e) by oligonucleo- 
tide-directed mutagenesis. This plasmid contains the bla 
gene, origin of replication and filamentous phage IG 
sequence from pZ152, and a modified fragment of P22 
DNA extending from the Esal site in gene 13 to the Hpal 
site in gene 15. The P22 segment was altered by the 
replacement of sequences between the Hinfl site in the 5' 
end of gene 19 and codon 54 of gene 15 with: (1) a 
synthetic DNA segment containing a stop codon for gene 
. 19 and an Ndel site overlapping the initiating ATG codon 
of T4 gene e; and (2) a segment of T4 DNA containing the 
rest of gene e and 42 bases downstream from it. The Ndel 
site normally present in the pZ152 sequences was removed 
by cutting with Ndel, filling in and religating, so that the 
Ndel site overlapping the translational initiation codon of 
gene e would be unique. 

Plasmid constructions were done by using standard 
methods (Sambrook et ah, 1989). A segment of DNA from 
pBR322 containing the tetracycline resistance deter- 
minant was inserted into pGFIB:sup GLY and 
pGFIB:sup ALA, generating pDR621 and pDR622, 
respectively. This was done by digesting the plasmids 
with CZoI, filling in the ends, and ligating with the smaller 
fragment of pBR322 generated by digesting with EcoRI 
and P villi and filling in the ends. 

A series of plasmids containing small portions of the T4 
lysozyme gene was constructed for marker rescue experi- 
ments (see Fig. 1). Plasmid pDR609, which contains 
codons 47 to 164, was made by digesting pLH449 with 
SacI and Sail, filling in the ends and religating; pLH449 
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Figure 1. Genetic structure of P22 e416. The top line 
shows the structure of the P22 chromosome in the vicinity 
of its lysozyme gene (19) with the location of the Kn321 
insertion indicated. The corresponding part of plasmid 
pLH416 is drawn below, with part of its DNA sequence 
near the start of gene e. The 4 underlined codons are (left 
to right): the translational initiation codon of gene 19, the 
stop codon of gene 13, a synthetic stop codon for gene 19, 
and the translational initiation codon of gene e. The 
broken lines indicate crossover points that lead to the 
generation of P22 e416 by homologous recombination 
between the defective prophage P22 Kn321 sieA44 m44 
and pLH416. The hybrid phage lacks P22 gene 19, which 
is replaced by T4 gene e. It also lacks gene 15 function; 
however, this gene contributes relatively little to the 
plaque-forming ability of P22 under the conditions 
employed in this study (Casjens et al., 1989). The sequence 
of the e416 substitution near the 5' end of gene e is 
indicated. Segments of gene e borne by plasmids used for 
marker rescue experiments are indicated below, bp, base- 
pairs. 



was generated by primer-directed mutagenesis of 
pLH416, changing bases 135 to 138 of gene e from ATTA 
to GCTC, creating a Sacl site without changing the amino 
acid sequence of the lysozyme. Plasmid pDR610 was 
generated by digesting pLH449 with Sad and BamEI, 
filling in and religating; it contains codons 1 to 44. 
Plasmid pDR616 was generated by digesting pLH416 
with Hpal and Sail, filling in and religating; it contains 
codons 132 to 164. Plasmid pDR6U7 was generated by 
digesting pDR609 with EcoRl and BarriSl, filling in and 
religating; it contains codons 47 to 76. Plasmid pTP408 
was generated by digesting pLH416 with Hpal and 
BarnEl, filling in and religating; it contains codons 1 to 
131. Plasmid pDR618 was generated by digesting pTP408. 
with EcoKl and Sail, filling in and religating; it contains 
codons 79 to 131. Plasmid pTP406 was generated by 
digesting pLH416 with EooBl with Sail, filling in and 



religating; it contains codons 79 to 164. Plasmid pTP407 
was generated by digesting pLH416 with . EcoBl and 
BamEI, filling in and religating; it contains codons 1 to 
76. Plasmid pDR643 was constructed by ligating a 126 
base-pair HinR-Hpal fragment containing codons 91 to 
131 from pLH416 into the PvuII site of pBR322. 

(c) Bacteria 

E. coli strain W3110 lacP L8 (Brent & Ptashne, 1981) 
was used for propagation of plasmids and for growth of 
phage X stocks. Strain TP302 is W3110 {sup 0 ) lysogenized 
with P22 Kn321 sieA44 m44 (Rennell & Poteete, 1989). 
Strain GM1675 {dam-4 &(lac-pro) thi-1 supE relAl/F' lacl q 
AMIS pro'*'), used for propagating phage R408 (Russel et 
al., 1986) and generating single-stranded plasmids, was 
obtained from M. Marinus. 

Salmonella typhimurium LT2 strains MS 1362, MSI 363, 
MS1364 and MS1365 (all leuAam414, bearing the amber 
suppressor alleles supD, supE, supF and sup J, respec- 
tively), DB7000 (leuAam414), MS1868 (leuAam414 r'm+) 
and MS2310 (MS1868 bearing the plasmid pKM101amp R : 
Youderian et al., 1982) were obtained from M. Susskind. 
Strain TP246 is MS1868 bearing 2 plasmids; pTP298, in 
which the R and Rz genes of phage X are expressed under 
control of P lac UV5 (D. Herrick & A. R. Poteete, unpub- 
lished results); and the Jac/-expressing plasmid pMS421 
(Grana et al, 1988). Strains TP278, TP279, TP280 and 
TP282 are MS2310 bearing supE, supF, sup J and supD, 
respectively. Strains TP308 and TP309 are MS231G 
bearing plasmids pDR463 and pDR464, respectively. 
TP308 and TP309 were maintained by continuous pass- 
age in liquid culture for 1 year in the presence of tetra- 
cycline. During this time, variants with improved ability 
to tolerate the presence of the suppressor plasmids came 
to predominate in the cultures. Derivatives of these more 
plasmid-tolerant variants that subsequently lost the plas- 
mids were isolated by growth in the absence of antibiotics 
and screening for tetracycline and ampicillin-sensitive 
clones. One isolate of each, designated TP369 and TP368, 
respectively, was kept for further use. Strains TP371, 
TP372, TP374, TP375, TP377 and TP378 are TP368 
bearing plasmids pFTORl A26 (ARG), pDR621, 
pDR622, pGFIBrsup HIS A, pGFIB : sup GLU and 
pGFIB : sup LYS, respectively. Strain TP376 is TP369 
bearing pGFIB : sup PRO H. 

The Salmonella strains bearing suppressor plasmids 
tended to lose their ability to suppress amber mutations 
unless handled carefully. All such strains were stored with 
cryoprotectant at — 80 °C. Single colonies were obtained 
by streaking from the frozen cultures on LB agar plates 
and growing at 37 °C; individual colonies were then 
streaked on LB agar plates supplemented with antibiotics 
and incubated at 37 °C. The resulting colonies were then 
streaked on minimal agar plates containing antibiotics 
and incubated at 37 °C; single colonies thus obtained were 
then used to inoculate liquid cultures. If no colonies grew 
on the minimal plates, within 36 h, a colony from the 
corresponding LB agar plate with antibiotic was streaked 
on minimal agar supplemented with 20 fig leucine/ml. 
Colonies from the minimal plates (without, or, if neces- 
sary, with leucine) were used to inoculate liquid cultures. 
Strains TP278, TP279, TP280, TP282, TP375, TP376 and 
TP377 were grown in LB supplemented with ampicillin. 
Strains TP308 and TP309 were grown in LB supple- 
mented with tetracycline; TP374 was grown in LB, 
Strains TP371 and TP378 were grown in minimal medium 
supplemented with ampicillin. Strain TP372 was grown in 
minimal medium supplemented with tetracycline. Liquid 
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cultures were grown at 37 °C to densities of approximately 
2x-10 8 /ml, tested for their ability to plate- a battery of 
tester phages bearing amber mutations, and stored at 
4°C. Active cultures could generally be used for as long as 
14 days afterwards. It was usually, but not always, 
possible to obtain a new active culture by 50-fold dilution 
of the old one in fresh medium and growth at 37 °C to 
2xl0 8 /mJ. 

(d) Determination of suppression patterns 

Phage stocks were grown by inoculating 30 ml of LB 
supplemented with 10 mM-sodium citrate with 1 ml of a 
culture of suppressor bacteria grown as described above, 
but to a density of approximately 5 x 10 8 /ml, and a single 
plaque. (The use of citrate slightly improves the growth of 
the hybrid P22 e416 derivatives, which lack gene 15 
function (Casjens et ah, 1989).) Following aeration at 30 °C 
for 24 h, the culture was shaken with chloroform for an 
additional 30 min. Debris was removed by centrifugation 
at 7000 revs/min in a Sorvall SS34 rotor for 5 min, and 
phage were pelleted from the supernatant by centrifuga- 
tion at 15,000 revs/min for 90 min in the same rotor. 
Phage pellets were resuspended with 2 ml of BS. 

Portions (0*3 ml) of cultures of suppressor strains, 
grown as described above, as well as MS2310 and TP246, 
were mixed with 2*5 ml of molten top agar and poured on 
A agar plates. The top agar was allowed to harden for 5 to 
10 min at room temperature, and 7-/d portions of phage 
suspensions at 10 3 , 10 4 and 10 5 plaque-forming units/ml 
were spotted on the surface. Three sets of plates were 
done for each phage: duplicate plates for incubation at 
37 °C and single plates for incubation at 25 °C. After the 
spots dried, plates were incubated in an inverted position 
for 16 h, then scored by 2 individuals independently; 
scores were compared, and any discrepancies were 
resolved by re-examination of the plates. The plaque- 
forming ability of each mutant was assessed relative to 
that of the control wild-type hybrid phage on the same 
host on the same day. It was designated + *f if the 
plaques were about the same size as the control; *f if they 
were significantly smaller; + if they were of such a small 
size that it was difficult to discern individual plaques; and 
— if no plaques were produced at all. A 5th plating 
phenotype was occasionally seen: plaques of sufficient size 
to be scored as +, but having a hazy morphology that 
made them less visible than others of the same size; such 
cases were scored ± . The absolute sizes of plaques in the 
+ + and + categories varied from one day to another. 
However, the size of a mutant phage plaque as a fraction 
of that of the wild- type control was relatively stable. If 
plating conditions were changed, for instance by varying 
the density of the plating culture or the moisture of the 
plates, or the level of suppressor activity of the bacterial 
strain, it was possible in some cases to observe changes of 
one grade up or down in the score of a mutant phage. 

(e) Construction of amber mutant phages 

Amber mutations were introduced into each codon in 
the T4 lysozyme gene by the use of mismatched oligo- 
nucleotide primers. Primers were, in - most cases, designed 
to carry the amber codon TAG with the outermost 
mismatch on either side flanked by 8 properly matched 
bases; 19-mers were thus used most frequently. 

The targets for mutagenesis in vitro were, in most cases, 
gapped duplex DNA (Kramer & Fritz, 1987). These 
templates were made by mixing single-stranded, non- 
dam-methylated pLH416 with purified, large, restriction 



enzyme-generated double-stranded fragments of the 
plasmid, denaturing and reannealing. Gaps were designed 
to leave only the portion of the lysozyme targeted for 
mutagenesis in the single-stranded state. Thus, for 
example, in the cases of codons 81 to 117, most of the 
mutagenic oligonucleotides were annealed to a gapped 
duplex preparation in which the single-stranded part 
corresponded to the sequences between the EcoRI and 
Hpal sites in the lysozyme gene (see Fig. 1). In 46 cases, 
however, the mutagenesis target was simply single- 
stranded, non -dam-methylated pLH416. In these cases, 
mutagenesis was carried out with the use of T7 DNA 
polymerase, as described by Bebenck & Kunkel (1989). 

Clones bearing plasmids with amber mutations in the 
lysozyme gene were identified, and the mutant alleles 
were crossed into P22 Kn321 sieA44 m44, as described 
(Rennell & Poteete, 1989). In some cases, for instance 
those of codons 162 to 164, in which amber mutations do 
not inactivate lysozyme function even in a non- 
suppressing host, it was found to be advantageous to 
employ high -efficiency mutagenesis procedures and iden- 
tify amber mutation-bearing clones directly by 
sequencing. 

(f) Sequencing 

The sequence of phage DNA in the vicinity of each 
amber mutation was determined by extension of labeled 
primers with reverse transcriptase in the presence of 
dideoxynucleotides (Inoue & Cech, 1985). DNA was puri- 
fied from phage stocks as described (Rennell & Poteete, 
1989); it was digested with HindlLI and BglH or with 
HindHI alone, heat-denatured and used without further 
purification in the sequencing reactions. Primers used for 
sequencing were: 

(A) 5' CAGCAGTAACGGAATC 3', 

(B) 5' CAAAAAGTCCATCACT 3', 

(C) 5' TCTGAGAAATGCTAAA 3', and ' 

(D) 5' CAAAAACGCTGGGATG 3'; 

they were labeled by polynucleotide kinase- catalyzed 
phosphorylation with [y- 32 P]ATP, and used following 
extractions with phenol and ether. 

Annealing mixtures contained 4 /d of , template DNA 
(about 0-5 /ig), 3 /d of primer-labeling reaction mixture 
(1 pmol primer), 2 p\ of 5x AB, and 1 /d of water. They 
were incubated in boiling water for 3 min, quickly spun 
down and placed in crushed dry ice until frozen solid, then 
thawed at 0°C. 

Sequencing reaction mixtures contained 2 /d of 
annealing mixture, 1 /d of dNTP mixture, 1 /d of ddNTP 
mixture, and 1 /d of reverse transcriptase mixture. They 
were incubated at 50 °C; after 15 min, 1 fi\ of reverse 
transcriptase mixture was added and incubation was con- 
tinued for an additional 15 min. The reactions were 
stopped by addition of 6 /d of gel loading buffer and 
incubation in a 90 °C water bath for 3 min. Samples of 3 /d 
were subjected to electrophoresis in an 8% (w/v) acryl- 
amide/0-4% (w/v) bis-acrylamide sequencing gel. 

In all cases, the presence of amber codons at the 
expected positions, as well as the absence of other 
mutations in the vicinity, was verified by sequencing. The 
segments of the lysozyme gene sequenced varied .among 
the different mutants, but generally included the entire 
segment that was exposed as a single strand in the gapped 
circular duplex template. It always included the entire 
segment into which the conditionally lethal mutation was 
mapped in the phage by marker rescue tests (described 
below). 
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In 13 cases, the mutagenesis procedures resulted in the 
introduction of mutations in addition to or instead of the 
intended amber mutation. Most (9) of these appeared to 
be primer-directed, resulting from partial homology of the 
mutagenic primer to secondary sites in the lysozyme gene. 
In most cases, repetition of the mutagenesis procedure 
(often, in these cases, employing a lower concentration of 
primer) sufficed to generate the amber mutant without 
secondary mutations. In 2 cases, those of am57 and 
am 164, the primer was redesigned as a 31-mer with the 
amber codon in the middle; the longer primers introduced 
the desired amber mutations without secondary 
mutations. 

(g) Marker rescue tests 

To provide assurance that the sequenced amber 
mutations at the intended codons were solely responsible 
for the plating phenotypes of the phages bearing them, 
marker rescue tests were carried out. These were done by 
spotting 10 /il portions of phage suspensions at titers of 
10 6 / ml on lawns of MS2310 (sup 0 ), with and without 
plasmids, described above, containing parts of the lyso- 
zyme gene (see Fig. 1). The ability of the phage to form 
plaques at an elevated frequency on a plasmid-bearing 
host was interpreted as signifying that the phage 
contained no deleterious mutations outside the chromo- 
somal segment represented in the plasmid. All of the 
amber mutants behaved as expected in these tests; 



marker rescue was detected, in general, if the amber 
codon was 2 codons or more away from either end of the 
chromosomal segment represented in the plasmid (hot 
shown). The amber mutations located near the ends of 
small segments were mapped by marker rescue' from plas- 
mids bearing larger, overlapping segments. 

3. Results and Discussion 

A collection of 163 phages bearing amber 
mutations in the T4 lysozyme gene was plated on 
amber suppressor strains and scored for plaque- 
forming ability at 37 °C and 25 °C. The results are 
shown in Table 1 ; plating phenotypes are illustrated 
in Figure 2. Plaque-forming ability was judged as 
described in Materials and Methods. In what 
follows, scores of + + or + will be considered 
positive, and ± or — will be considered negative; 
the descriptions of phenotypes refer to plating at 
37 °C, unless otherwise specified. 

(a) Interpretation of amber mutant 
suppression patterns 

In principle, the experiment summarized in Table 
1 tests the qualitative functionality, relative to 
wild-type, of 2015 mutant lysozymes bearing single 




Figure 2. Plating phenotypes of derivatives of P22 e416 bearing amber mutations in gene e. In the plates shown, 
phages bearing amber mutations in codons 2, 59, 161 and 11 (in order from top to bottom) were plated on strains TP280 
(Leu-inserting; left side) and TP246 (fully permissive, expresses R and R z genes of phage X\ right side) at 37 °C. Phage 
suspensions had titers of 10 3 , 10 4 and 10 5 /ml (in order from left to right). The results illustrate the scores + + , + , ± arid 
— , respectively. 
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Table 1 (continued) 
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Plaque-forming ability was determined as described in Materials and Methods. Top rows refer to 
plating at 37 °C, bottom rows to plating at 25 °C WT, wild-type. 
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"> amino acid substitutions. However, interpretation 
\ of amber mutant suppression patterns is not com- 
pletely straightforward, for a number of reasons. 
First, the efficiency of amber suppressors is not 
' 100%, and varies from one suppressor to another. 
. Thus, the phenotype of an amber mutant growing 
i on an amber suppressor strain results from the 
combined effects of alteration of the lysozyme and 
under-expression relative to the wild-type. On the 
other hand, the suppressors used in these studies 
have been characterized as relatively efficient 
(Winston et ah, 1979; McClain & Foss, 1988; Kleina 
et ah, 1990). Moreover, studies with hybrid phages 
engineered to produce varying amounts of wild-type 
lysozyme indicate that, for a mutation to be scored 
as defective, it must reduce total lysozyme activity 
to less than 3% of the amount produced by the 
wild-type hybrid phage (Knight et ah, 1987; our 
unpublished results). Finally, as indicated in Table 
1, every suppressor works with mutants bearing 
amber codons at a large number of positions, 
including all cases in which the resultant poly- 
peptide is supposed to have the wild-type amino 
acid sequence. (There are 104 cases of non- 
substitution in the Table.) 

Second, the efficiency of amber suppressors is 
subject to variation from one amber codon to 
another. Such "context effects", though, are rela- 
tively small (generally less than 3-fold: Miller & 
Albertini, 1983; Bossi, 1983). Indeed, every amber 
mutant forms large plaques at high efficiency on at 
least one amber suppressor. Still, general suppressor 
efficiency and context effects may play a significant 
role in determining some of the phenotypes shown. 
Such a role is suggested by the observation that in 
11 cases (codons 43, 46, 84, 96, 97, 98, 121, 124, 125, 
133 and 147) an amber mutant phage makes a 
smaller than .wild-type plaque on an amber 
suppressor strain that inserts the wild-type amino 
acid. 

A third factor complicating interpretation of the 
amber mutant suppression patterns shown in Table 
1 is the question of suppressor specificity. Although 
all of the suppressors used here have been well- 
characterized, two types of uncertainty remain. 
(1) Only a small amount of residual lysozyme 
activity is needed for plaque formation; therefore, 
an amber suppressor that inserted a particular 
"incorrect" amino acid only a small fraction of the 
time could .give false positives. Indeed, the Glu- 
inserting suppressor has been found to insert Gin 
17%- of the time (Normanly et ah, 1990). (It should 
be noted, however, that the data in Table 1 indicate 
a number of mutants that form large plaques on the 
Gin-inserting suppressor strain, and no plaques at 
all on the Glu-inserting suppressor strain.) (2) The 
synthetic suppressors were characterized in E. coli, 
but tested with T4 lysozyme amber mutants in 
S. typhimurium. While we suppose that their specifi- 
cities would be the same in these two closely related 
species, this has not been determined directly. 

On the other hand, a number of observations 
suggest that the suppressors behave largely as 



expected. The mutants that plate on the fewest 
suppressor strains tend to plate only on those that 
make conservative substitutions (as well as non- 
substitutions of course). The exemplar of this ten- 
dency is amll. In T4 lysozyme, Glull is thought to 
be the key catalytic residue (Anderson et ah, 1981); 
as one might therefore expect, amll plates only on 
the Glu-inserting suppressor. In another example, 
Table 1 indicates that Gly30 can be replaced only 
by Ala. Similarly, Tyrl61 can be replaced only by 
Phe, and Trpl38 can be replaced only with Tyr, Phe 
or Leu. With 13 different suppressors, and all 20 
amino acids represented in the lysozyme poly- 
peptide, 247 different kinds of substitution can be 
made. For any given amino acid, it is possible, by 
analysis of the data in Table 1, to determine how 
well its positions in the protein can be substituted 
by any of the 13 nominally represented in the 
collection of suppressors. Upon carrying out such an 
analysis (not shown), we found, for example, that 
alanine residues can be replaced most effectively by 
Gly and Ser. Similarly, serine residues can be 
replaced most effectively by Ala and Gly, and 
glycine residues by Ala and Ser. These three amino 
acid residues thus constitute a self-contained 
"exchange group", clearly related chemically by 
virtue of being the three smallest. Likewise, Phe and 
Tyr were found to constitute a self-contained 
exchange group. Little other useful information was 
obtained from such an analysis of the entire 
suppression pattern; it was far more informative 
when these relationships were examined within two 
groups of lysozyme residues, buried and solvent- 
exposed, identified by examination of the structure 
(see below). 

A fourth complication characteristic of amber 
mutations is "leakiness". Bossi (1983) found that 
amber codons in a lacI-Z fusion gene gave rise . to as 
much as 2 % of the wild- type level of activity in a 
non -suppressing Salmonella background. Amber 
codons at eight positions in T4 lysozyme (8, 32, 41, 
48, 73, 162, 163 and 164) fail to prevent plaque 
formation on the non-suppressor strain; phages 
bearing the last three of these plate particularly 
well. Two likely explanations for this phenomenon 
are: (1) the codons in question are translated 
through at relatively high frequency and represent 
positions that are tolerant of substitutions; (2) the 
amber mutants produce active lysozyme fragments. 
We believe that the first explanation pertains in the 
first five cases, the second in the last three. The fact 
that tight negative phenotypes are observed for 
amber mutations of most codons through 161 
suggests that shorter N-terminal fragments of T4 
lysozyme are not, in general, active. We tested the 
possibility that particular amber fragments might 
be active by constructing variants of pLH416 
bearing amber codons at positions 71, 72 and 73, in 
combination with a deletion that eliminates 
sequences coding for residues 79 through 164. These 
plasmids were all able to recombine with the defec- . 
tive prophage P22 Kn321 to yield phages that could 
form plaques on a host (TP246) that supplies lysis 



functions to the infecting phage; none of these 
recombinants was able to form plaques on a normal 
host (not shown). In addition, we have found ..the 
double amber mutant P22 e416 am73 ami 61 does 
not form plaques on a non -suppressor host (not 
shown); if the fragment produced by the amber 
codon at position 73 were active, the presence of a 
second amber mutation downstream would presum- 
ably make little difference. Phipps et al (1987) have 
described an active fusion protein consisting of 
residues 1 to 78 of T4 lysozyme followed by 19 other 
amino acid residues derived from a cloning vector. 
The activity of such a fusion protein would appear 
to be difficult to reconcile with our finding that a 
number of residues past 78 are critical for lysozyme 
function. Possibly, the amino terminus of T4 lyso- 
zyme contains all the necessary residues for cataly- 
sis, but requires additional structure at the carboxy 
terminus for stability, and the 19 amino acid 
residues in the fusion protein fortuitously provide 
such stabilizing structure. 

(b) Overall tolerance to substitutions 

The data in Table 1 show the overall functional 
effects of 2015 substitutions at 163 out of the 164 
positions in the T4 lysozyme polypeptide. Only 328 
of these substitutions, affecting 74 of the residues, 
were scored as deleterious. From these results it 
would appear that most (89 out of 163, or 55%) of 
the positions in the molecule are insensitive to 
substitution, being able to tolerate a minimum of 13 
different amino acid residues. 

Tolerance to amino acid substitutions is a salient 
feature of proteins. That there are many sequences 
that can form the highly conserved globin structure 
became clear from the work of Perutz et al. (1965). 
Subsequent studies of over 300 mutant human 
hemoglobins, some defective, but most fully func- 
tional, lead to the same general conclusion (for a 
review, see Weatherall & Clegg, 1976). Miller and co- 
workers (Miller et al, 1979; Kleina & Miller, 1990) 
pioneered the approach, used in this study, of using 
nonsense mutations and suppressors to test the 
effects of amino acid substitutions. This approach 
permits the study of mutant variants with no 
requirement that they be able to arise or survive in 
natural populations. By testing nonsense mutations 
ui 141 of the 360 positions of the E.coli lac 
repressor, these investigators were able to examine 
the effects of 1634 single amino acid substitutions. 
They found that approximately 70% of the substi- 
tutions in the DNA-binding amino-terminal 
domain, and 30% of the others (approx. 45% 
overall) led to loss of function. Loeb et al (1989) 
found that most positions in HIV-1 protease could 
tolerate amino acid substitutions. Sauer and co- 
workers (Reidhaar-Olson & Sauer, 1988; Bowie et 
al, 1990) have employed a more radical form of 
mutational analysis, called combinatorial cassette 
mutagenesis, to study the question of tolerance to 
amino acid substitution. In this method, a segment 
of DNA sequence encoding a part of a protein is 



effectively randomized in vitro; and variants 
encoding functional proteins are subsequently 
selected and sequenced. Using such methods, these 
investigators have found that most residues of the 
phage X repressor amino-terminal domain tolerate 
many substitutions. 

Among the studies cited above, the results of 
Kleina & Miller (1990) are most precisely compar- 
able to those reported here. Out of 132 positions 
represented in the collection of loci amber 
mutations, 30 were completely insensitive to substi- 
tution; an additional 22 showed only partial loss of 
function. If the 37 % of the residues of lac repressor 
represented in the collection of nonsense mutations 
are typical, then these figures can be compared to 
our finding that 89 out of 163 residues in T4 lyso- 
zyme can be successfully substituted by any 
member of the same set of 13 amino acids. 

The fraction of substitutions in T4 lysozyme that 
are functional is a quantity that can be varied 
almost arbitrarily in the system described here. In 
constructing the hybrid phage used for these 
studies, we found that the level of expression of the 
wild-type lysozyme gene could be varied, by genetic 
engineering, over a nearly 1000-fold range without 
loss of plaque-forming ability (Knight et al, 1987- 
Hardy & Poteete, 1991). A level near the middle of 
this range, comparable to the amount of P22 lyso- 
zyme made by wild-type P22, was chosen. It should 
be noted that, in a hybrid phage that makes less : 
lysozyme, more substitutions are scored as deleter- 
ious; in one that makes more lysozyme, fewer 
substitutions are apparently deleterious (not 
shown). 
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(c) Buried residues are sensitive to substitution 



In Figure 3, positions in T4 lysozyme that are 
sensitive to substitution are indicated, as well as 
positions occupied by residues with solvent- 
inaccessible side-chains. The striking correlation of 
these properties indicates that interior positions are, 
in general, more sensitive to substitution than 
surface residues. Two conspicuous exceptions to this 
generalization are Glull and Asp20, both of which 
are thought to be directly involved in catalysis 
(Anderson et al, 1981; Anand et al, 1988; Hardy & 
Poteete, 1991). These two residues are solvent- 
accessible, but not generally replaceable. The 
interior location of residues that are sensitive to 
substitution is illustrated in. Figure 4. 

Perutz et al (1965) observed that surface residues 
exhibited little conservation among naturally occur- 
ring globins with highly conserved three- 
dimensional structures. In general, they could be 
substituted by both polar and non-polar residues; 
polar residues, though, were excluded from interior 
positions. Sauer and co-workers, using combina- 
torial cassette mutagenesis, have found that posi- 
tions on the surface, but not in the interior, of the 
amino-terminal domain of X repressor,' could 
tolerate substitution with hydrophilic residues (for a 
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Figure 3. Sensitivity to substitution versus solvent inaccessibility. Positions of residues in the lysozyme sequence are 
indicated along the line by numbers. For each position, the height of the bar above the line is proportional to the number 
of deleterious substitutions found (Table 1), ranging from 2 to 12 (positions at which only 1 deleterious substitution 
occurs are not shown). Those residues with side-chain solvent accessibilities of less than 12% are indicated by bars below 
the line. (Solvent accessibility here refers to the calculated surface area that can be contacted by a sphere of radius 1-5 A, 
expressed as a percentage of the accessible surface area of the same residue in the unfolded state: S. Dao-Pin & 
L. Weaver, personal communication.) 



review, see Bowie et at., 1990). In Table 2, each 
amino acid nominally inserted by a suppressor is 
scored according to the efficiency with which it 
replaces solvent- accessible and buried residues. As 
can be seen, all residues function nearly equally well 
in solvent-accessible positions, while Glu, Pro, Lys 
and Arg stand out as not being tolerated in buried 
positions. The conclusion that emerges from this 
analysis is that charged residues and proline are 
generally not acceptable in interior positions in lyso- 
zyme. The possibly surprising observation is that 
non-charged residues of high or moderate hydro-, 
phobicity, Gin in particular, generally are accept- 
able; this contrasts with what has been found for A 
repressor. One possible explanation is that a substi- 
tution that buries a non-charged "but polar residue 
m the hydrophobic core may, indeed, destabilize 
lysozyme, but not enough to be scored as deleterious 
in this system. That mutations must be highly 
deleterious to be scored as such in this system is 
clear from comparison of the data in Table 1 with 
accounts of detailed studies of mutant T4 lysozymes 
by others (see below). 



* (d) Temperature sensitivity 

Of the 2015 substitutions indicated in Table 1, 
261 exhibit qualitatively better function at 25 °C 
than at 37 °C. In 96 cases, this difference results in 
a crossing of the boundary between -f and ±, 
or a classic temperature -sensitive phenotype. 
Temperature sensitivity of a mutant protein 
relative to wild-type is generally regarded as an 
indication of structural destabilization. Alber et at. 
(1987), in a study of 25 mutant T4 lysozymes, found 
that temperature- sensitive mutations occurred at 
sites with low mobility and low solvent accessibility. 
A striking correlation of the sites of the mutations 
with low values of average side-chain thermal 
factors in the refined structure model was noted. 
Relating the set of mutant proteins studied by these 
investigators to the entire set in Table 1 is compli- 
cated: the latter contains many of the former but, in 
addition, contains others with far more drastic 
effects on the protein, probably including many 
mutations that so destabilize lysozyme that it is not 
functional at any . physiological temperature. 
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Figure 4. Positions of critical residues in T4 lysozyme. Atoms in white are those of residues whose replacement (by at 
least 2 others) leads to loss of function. The Figure was generated with Promodeler molecular graphics software (New 
England Biographies), using co-ordinates from the Brookhaven Protein Data Bank. Two views, rotated by 180°, are 
shown. 



However, one prediction related to the findings of 
Alber et aL is that temperature-sensitive mutations 
would not be found at positions with high mobility. 
This is, indeed, the case, as shown in Figure 5, in 
which the tendency of substitutions at each position 
to exhibit temperature sensitivity is plotted as a 
function of average side-chain thermal factors. A 
statistical analysis of these data rejects the null 
hypothesis: that side-chain mobility is unrelated to 
the potential temperature sensitivity of its substitu- 
tions. Among 163 residues, temperature- sensitive 
substitutions are found in 42, or ~26%. The 37 
residues with the highest thermal factors are not 
represented among these 42. The probability of 
picking, at random, a collection of 37 residues with 
no temperature-sensitive substitutions, would be 
-0-000016. 

The data in Table 1 also indicate 202 substitu- 
tions with better scores at 37 °C than at 25 °C, 
including 60 that cross the boundary between + 
and ± . It is uncertain whether any of these repre- 
sent genuine cold-sensitive proteins. More than half 
of them occur on the Cys-inserting amber 
suppressor strain. The Cys suppressor strain grows 
well at 25 °C, but does not suppress amber 



mutations well at this temperature, including ones 
at the two Cys codons in the lysozyme gene. In 
addition, the Tyr suppressor strain is itself mildly 
cold-sensitive; it grows poorly at 25 °C. Moreover, all 
the apparently cold-sensitive pheno types are cases 
in which the scores are + at 37 °C and ± at 25 °C; 
no greater differences are seen. Finally, the hybrid 
phages lack P22 gene 15. Although this gene is not 
essential, its -small contribution to P22's plaque- 
forming ability is relatively greater at low tempera- 
tures (Casjens et aL, 1989); thus, we might expect 
that the effect of a mild mutation in the lysozyme 
gene of P22 e416 would be slightly exacerbated at 
25 °C relative to 37 °C, unless the mutant protein 
itself were stabilized by the lower temperature. 

(e) Proline- sensitive a-helix 

The inability of proline residues to adopt many of 
the backbone conformations available to amino acia 
residues might impose severe limitations on the 
number of positions in a protein that it can occupy- 
In particular, proline residues cannot fit into an 
a-helix without distorting it, as well as destabilizing 
it due to its lack of a backbone amide for hydrogen 
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Table 2 

^vera^e suppression scores 



Positions in lysozyme 
Solvent-exposed Buried 



.^-•Suppressors 
■ '.vHydrophobict 

;:;';cys. • \ 

Leu 

• Phe ' 

Moderately polar 
Ser 
His 
Tyr 

.- Very polar 
•Glu;' 
.Gin 

Arg 

Exceptional backbone torsional properties 
Gly 2-9 
Pro' ' 2*4 



2-9 
2-5 
2-9 
2*7 

2-9 
2-9 
2-8 

2-7 
2*9 
2-4 
2*7 



2-8 

1- 8 

2- 3 
20 

2*5 
1*8 
1*9 

1- 0 

2- 2 
1*0 
0*9 

2*4 
1*1 



pata in Table 1 are converted to numbers (+ + = 3, + — 2, 
± '= 1, — =0), and averages are calculated for different subsets 
of residues, defined as for Fig. 2. 

;f Residues are assigned to groups according to Rose et al. 
(1985), except that Gly and Pro are placed in a separate group. 



.'ponding. Indeed, the data in Table 1 show that 
; proline is the second most frequently unacceptable 
residue; 53 proline substitutions are deleterious, 
compared with 58 lysine substitutions, the most, 
frequently unacceptable. If only the strongest 
defects, those resulting in a (-) score, are con- 
sidered, proline is the most frequently unacceptable 
residue. In contrast, only four alanine substitutions 
are deleterious. The pattern of proline-sensitive 
positions in the polypeptide does not particularly 
reflect the locations of a-helices, however. T4 lyso- 
zyme contains nine a-helices, comprising residues 3 
to 10, 39 to 49, 60 to 79, 82 to 90, 95 to 106, 115 to 
122, 126 to 134, 137 to 141 and 143 to 155 
(Remington et al, 1978). Of these, only one, 95 to 
106, shows a marked sensitivity to proline substitu- 
tions (except in the first turn, where proline residues 
are often found in a-helices). This particular a-helix 
is also unique in that a large part of it is completely 
internal, running through the core of the carboxy- 
terminal domain; the others, for the most part, lie 
on the surface of the molecule. Proline substitutions 
in this a-helix may have more profoundly destabil- 
izing effects than usual because of its interior loca- 
tion. This hypothesis- is supported by the 
dependence upon solvent inaccessibility of any 
a-helical residue's sensitivity to proline substitution. 
Of the 26 a-helical residues which are > 95 % buried, 
19 are sensitive. This contrasts sharply with the 
observation that only two of the 37 a-helical 
residues which are <50% buried are sensitive. This 
greater relative sensitivity of buried a-helical 
residues suggests that distortion of the helical 
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Figure 5. Temperature sensitivity (ts) versus side-chain 
mobility in the folded structure. The number of tempera- 
ture-sensitive substitutions (those resulting in scores of 
+ + or + at 25 °C, ± or - at 37 °C in Table 1) that occur 
at each residue is plotted as a function of the average 
thermal factor (B) of its side-chain atoms (not including 
a-carbon atoms, except in the case of Gly residues). 



conformation by proline substitution is, alone, 
insufficient to inactivate the protein; in most cases, 
further destabilization by the loss of a hydrogen 
bond donor for a buried backbone carbonyl is 
required. This idea is supported by consideration of 
those seven exceptional a-helical residues which are 
> 95 % buried but not sensitive to proline substitu- 
tion. In each case the peptide nitrogen is bonded to 
a carbonyl oxygen that is capable of forming a 
hydrogen bond with an alternative donor, usually 
solvent water. 



(f ) Critical residues 

The data in Table 1 suggest a rough rank order of 
residues by sensitivity to substitution. The 12 posi- 
tions with the lowest total suppression scores 
(counting -f + as 3, + as 2, ± as 1, and — as 0) are 
Glull (score: 3), Gly30 (7), Tyrlgl (7), AsplO (8), 
Trpl38 (8), Vall49 (9), Gly28 (10), Serl36 (14), 
Thr26 (15), Ala98 (15), Asp20 (16) and He58 (18). 
The crystal structure of T4 lysozyme (Weaver & 
Matthews, 1987), as well as other observations, in. 
many cases suggest plausible explanations for these 
sensitivities. 

Glull and Asp20 are assumed to be directly 
involved in catalysis on the basis of structural 
homology with their counterparts in hen egg white 
lysozyme, Glu35 and Asp52 (Anderson et al., 1981). 
Anand et al. (1988) have shown that substitutions of 
Gin and Asn, respectively, at these two positions 
lead to loss of enzymatic activity. The extreme 
sensitivity of Glull to substitution is fully con- 
sistent with its designation as the key catalytic 
residue. Asp20 can be replaced by Glu, as found by 
Anand et al. (1988); surprisingly, though, it can 
evidently be replaced by either Cys or Ala as well. 
We have purified T4 lysozyme bearing an 
Asp20— >Cys substitution (as the result of the direct 
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substitution of a Cys codon), and shown that it 
retains 80% of the specific activity of wild-type, 
and has, in addition, acquired a new sensitivity to 
thiol-modifying reagents (Hardy & Poteete, 1991). 
Similarly, P22 e416 bearing an Ala sense codon at 
position 20 forms plaques well, indicating that the 
result with the amber suppressor is not misleading. 
Moreover, if the sequence of the 029 lysozyme is 
aligned by homology with T4 lysozyme, it is found 
to have an alanine residue at the position occupied 
by Asp20 in T4 lysozyme (Garvey et al., 1986). 
These results would seem to call into question the 
nature or even existence of a catalytic role for 
Asp20 (Hardy & Poteete, 1991), but further studies 
are needed to clarify this issue. 

T4 lysozyme contains two buried salt bridges, 
between Argl45 and Glull, as well as Argl48 and 
AsplO. All of these residues could conceivably be 
important for stabilizing the position of the key 
catalytic residue Glull. Indeed, AsplO and Argl45 
are among the residues most sensitive to substitu- 
tion, and Argl48 is sensitive to at least some substi- 
tutions. On the other hand, the data in Table 1 do 
not indicate a stringent requirement for salt bridges 
per se> because all but Glull can be replaced 
successfully with uncharged residues (but note that 
only in the case of Argl48 do these substitutions 
result in no diminution of plaque size). 

The two buried salt bridges are themselves part of 
a larger network of internal interactions among a 
cluster of amino acid side-chains in the carboxy- 
terminal domain. One of the key residues in this 
cluster is Tyrl61. Its hydroxyl makes a hydrogen 
bond to the carboxyl group of AsplO, and closely 
approaches the side- chain of Vail 49; its aromatic 
ring fits neatly into a pocket created by the side- 
chains of Metl and Met6. The amide nitrogen of 
AsnlOl makes an additional hydrogen bond with 
the carboxyl group of AsplO, and AsnlOl, in turn, 
also closely approaches the side-chain of Vall49. 
The functional importance of this network is clearly 
shown by the data in Table 1, which indicate that 
all of these residues (except for Metl, which was not 
tested), are not freely replaceable; indeed, positions ' 
161 and 149 are among the most sensitive to substi- 
tution in the molecule. 

Among the other positions most sensitive to 
substitution, three, Gly30, Gly28 and Thr26, are 
small residues lining the active site cleft. All three 
can be most effectively replaced with other small 
residues. It seems plausible that the presence of 
bulky sider chains at these positions would interfere 
with substrate binding. Consistent with this idea, a 
mutant T4 lysozyme bearing a substitution of Gin 
for Thr26 is nearly unaltered in thermal stability, 
but highly defective in enzymatic activity (Poteete 
et al, 1991). 

The hydroxyl group of Serl36 is buried, and 
hydrogen-bonded to the backbone amide nitrogen of 
Tyrl39. Substitution by larger residues at this posi- 
tion is generally not tolerated. In the resulting 
mutant proteins, the polypeptide backbone at this 
position would most likely bulge out, perhaps mak- 



ing the molecule unstable or sensitive to proteases 
in vivo. 

The methyl side-chain of Ala98 fits into a small 
space bounded by a number of backbone atoms, 
particularly closely by those of residues 99, 95. and 
149. Substitution of Ala98 by any amino acid 
residue with a larger side-chain might be expected 
to lead to a significant structural alteration because 
of steric clash. Such structural alterations have, in 
fact, been observed in lysozyme bearing an 
Ala98-»Val substitution (Alber & Matthews, 1987). 
Consistent with this concept, the data in Table 1 
indicate that substitutions by residues larger than 
Cys are not tolerated, with the exception of Pro, 
which is. 

The ring nitrogen of Trpl38 is hydrogen -bonded 
to the side-chain amide oxygen of Gin 105. 
Otherwise, there is little to distinguish Trpl38 from 
any of the other residues of the hydrophobic core of 
the molecule, and thus to account for its extra- 
ordinary sensitivity to substitutions, which greatly 
exceeds that of the other hydrophobic core residues. 
Only Phe, Tyr and Leu are tolerated at this 
position. 

The data in Table 1 indicate that Ile58 is sensitive 
to substitution. This residue is part of the hydro- 
phobic core of the N- terminal domain. The pattern 
of its sensitivity to replacement by other residues is 
generally consistent with that of other hydrophobic 
core residues, but is exceptionally stringent: the 
phage bearing an amber mutation in codon 58 does 
not form plaques on the leucine-inserting amber 
suppressor strain. Examination of the structure 
does not immediately suggest why this position 
should be sensitive to such a conservative 
substitution. 



(g) Exceptional residues j 

j 

Figure 3 indicates a number of residues that are | 
solvent-exposed, but sensitive to several different } 
substitutions. In addition to the catalytic Glull and ! 
Asp20, these include His31, Asp70, Glnl05, PhelH 1 
and Thrl42. The sensitivities to substitution of i 
Gin 105 and Thrl42 are readily understandable in " 
light of the model of Anderson et al (1981): both are 
likely involved in contacts with substrate. % 

His31 and Asp70 are salt-bridged to each other. 
Anderson et al. (1990) have studied mutant T4 
lysozymes in which the His31-Asp70 salt bridge is 
disrupted by replacing these residues with Asn, and ! 
concluded that it contributes significantly to the j 
thermal stability of the protein. Although the data 
in Table 1 do not include Asn substitutions, they .;. 
indicate that both positions are sensitive to ft : ; 
number of other substitutions. It would seem .;, 
reasonable to attribute the defects of the substi- J 
tuted proteins to destabilization via disruption of.yj 
the salt bridge, except that some of the fully func-y 
tional substitutions at these positions involve^-, 
residues that cannot participate in salt bridges, orjj 
even hydrogen bonds. Possibly, these residues 
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■r^r; -^"contribute to lysozyme function in other ways as 

£^:^f^ell 

■'•:*" Phel 14 sits in a crevice on the surface of the 
V(>ternunal domain, with one face of the aromatic 
- Sing packed against the hydrophobic interior, and 
:^he other fully exposed to solvent, yet it is as 
sensitive to substitutions as residues that constitute 
the hydrophobic core. Phell4 has, in addition, not 
been implicated in substrate binding or catalysis 
(Anderson et al, 1981). One hypothesis to account 
for its apparently critical role in the protein is that 
it normally engages in an aromatic-aromatic inter- 
action, of the type described by Burley & Petsko 
(1985), with Trpl38. Understanding of the role of 
Phel 14 will have to await characterization of the 
properties of mutant proteins altered at this 
position. 

Results summarized in Figure 3 and Table 1 
indicate that five relatively solvent-inaccessible 
residues are completely insensitive to substitutions: 
Cys54, Gly56 5 Val71, Ala74 and Gly77. The side- 
chain of Cys54 projects into the interior of the 
amino -terminal domain. Replacing it with large 
residues would appear to require some local 
rearrangement of the structure; its insensitivity to 
substitutions is thus not readily apparent from the 
structure. On the other hand, accommodating a 
large side- chain in the position of Gly56 would 
appear to require only displacing the side-chain of 
either Lysl6 or Lys43, both of which lie along the 
surface of the protein. In the case of Val71, one of 
the y-carbon atoms is exposed to solvent; it may be 
possible to extend the longer side- chains of large 
residues distally from this position without greatly 
distorting the structure. Similarly, the jS-carbon of 
Ala74 is solvent-accessible from the direction in 
which a larger side-chain would extend. Access of 
the a-carbon of Gly77 to solvent is blocked only by 
the side-chain of Arg80, a residue with high side- 
chain mobility lying along the surface of the lyso- 
zyme molecule. 
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(h) Relationship of these results to previous studies 

T4 lysozyme has been the subject of genetic 
studies for many years (see Streisinger et al,, 1966); 
many mutant lysozymes have been characterized. 
The present systematic study of single amino acid 
substitutions effected by amber suppression thus, 
directly or indirectly, reproduces parts of many 
previous studies. 

The results shown in Table 1 directly reproduce 
those of Tsugita and co-workers (for a review, see 
Tsugita, 1971), who determined the suppression 
patterns of amber mutants affecting codons 126, 138 
and 158 (all normally encoding Trp), as well as 
codons 69, 105, 122, 123 and 141 (all Gin residues) 
on amber suppressors inserting Ser, Gin and Tyr. 
Although these investigators examined the plating 
of phage T4 on E. coli, while we tested the plating of 
a hybrid P22 phage on S. typhimurium, our 
observations are in agreement with theirs. 

Most of the characterized mutant T4 lysozymes 



result from missense mutations, or from combina- 
tions of frameshift mutations -.that- restore -the 
reading frame but cause deletions, insertions and 
substitutions. Characterization of the latter class 
was historically important in establishing the 
nature of the genetic code (see Streisinger et al, 
1966). 

Remington et al (1978) summarized the pheno- 
types of most of the then existing, characterized T4 
lysozyme mutants and interpreted them in light of 
the crystal structure. Phenotypically mild mutants 
of T4 lysozyme bearing multiple substitutions as a 
result of compensating frame-shift mutations occur 
in stretches of residues that are relatively 
insensitive to substitutions, according to the data in 
Table 1. These stretches include 2 to 4, 22 to 25, 34 
to 40, 73 to 76 and 139 to 140. On the other hand, 
more seriously defective mutants of the same deri- 
vation affect residues identified in Table 1 as sensi- 
tive to substitution: Asp20, Trpl26 and Trpl38. In 
further agreement with the results in Table 1 , these 
authors reported that the substitutions Asn2— >-Arg, 
Thr34->Gln, Tyr88->His and Glnl41->Arg all have 
relatively small effects on T4 lysozyme activity. 
Moreover, they reported that removal of the last 
two residues of T4 lysozyme with carboxypeptidase 
had little effect on the enzyme; our finding that 
amber mutations in the last three codons are non-* 
deleterious in the non -suppressing host is consistent 
with this result. 

The lysozyme of the related phage T2 differs in 
sequence from that of T4 by three single amino acid 
substitutions: Asn40->Ser, Ala41-»Val and 
Thrl51-*Ala (Inouye & Tsugita, 1968). The two of 
these changes that are represented in Table 1 have 
no effect on lysozyme activity, as would be 
expected. The third, Ala41-»Val, has been shown 
to increase the thermal stability of T4 lysozyme 
(Dao-Pin et al, 1990). 

Perry & Wetzel (1985) described a mutant T4 
lysozyme, Ile3-*Cys, in which a novel disulfide bond 
was formed between Cys3 and Cys97. The results 
shown in Table 1 suggest that this mutant protein is 
active, in agreement with the findings of these 
investigators. These investigators also found (Perry 
& Wetzel, 1987) that substitutions of Val and Ser 
for Cys54 and Cys97, respectively, stabilize the pro- 
tein to oxidative damage, and do not harm activity. 
The latter mutation is represented in Table 1, and is 
scored as functional. Similarly, Matthews and co- 
workers have used a double mutant, Cys54->Thr/ 
Cys97->Ala, as a starting point for studies of 
mutant variants, because this "cysteine-less wild- 
type" is not as sensitive to oxidation as the wild- 
type (Matsumura & Matthews, 1989; Pjura et al, 
1990). 

Alber & Matthews (1987) described a set of 21 
mutant lysozymes with single amino acid substitu- 
tions as falling into four classes: tight ts, leaky ts, 
low activity and heat resistant-low activity. Of 
these 21 mutants, 12 are represented in Table 1. 
Both members of the tight ts class, Leu66-»Pro and 
Leu91-»Pro, are defective in our system 



. ; (ts = temperature-sensitive); the latter exhibits a 
- particularly pronounced temperature sensitivity. 
Among the substitutions classified as leaky ts, eight 
are represented in Table 1; only one, Trpl26-»Arg, 
is scored as defective. The single low activity 
mutant- Glul28->Lys (originally described by 
Grutter & Matthews, 1982) is not scored as deleteri- 
ous in Table 1 , nor is the single heat-resistant low 
activity ♦ mutant Cys54->Tyr. Alber & Matthews 
(1987) additionally described a set of 13 single 
amino acid substitutions of Thrl57, all of which 
mildly destabilize the protein. Table 1 nominally 
contains information on nine of these substitutions; 
none is defective in function. Alber et al. (1988) 
further described a series of ten single amino acid 
substitutions of Pro86, all of which had small effects 
on stability and activity. Eight of these substitu- 
tions are represented in Table 1; none is defective. 
Among the 25 temperature-sensitive substitutions 
described by Alber et al. (1987), 14 are represented 
in Table 1; five of them (including one not 
previously mentioned, Trpl38-»Gly) are scored as 
defective. 

Matsumura et al. (1988) described a, series of 13 
single amino acid substitutions of Ile3, all of which 
had mild -effects on protein stability. None of the 
eight of these substitutions represented in Table 1 is 
scored as defective there. Other T4 lysozyme 
mutants characterized by Matthews and co-workers 
as having little or no effect on activity or stability, 
and which are additionally represented in Table 1, 
include: Leul33->Phe (Karpusas et al., 1989); 
Gly77->Ala and Ala82-+Pro (Matthews et al., 1987); 
Asn55— ►Gly and Lysl24->Gly (Nicholson et al., 
1989); and Vall31->Ala (Dao-Pin et al., 1990). All of 
these are scored as fully functional in Table 1 . 

Overall, the results shown in Table 1 are in good 
agreement with previous studies of T4 lysozyme 
mutants, including a number of unpublished ones 
(S. Dao-Pin, E. Eriksson, A. Morton & B. Matthews, 
personal communications). The best-characterized 
mutant lysozymes generally have minimally altered 
structures. In general, such mutant proteins are not 
scored as defective in Table 1, even though some of 
them have significantly reduced thermal stabilities. 
However, it is reasonable to assume that a mutation 
that reduces the melting temperature of T4 lyso- 
zyme at nearly neutral pH from over 60 °C to 
around 50 °C (for instance) would not inactivate 
lysozyme function at 37 °C. Thus, in general, the 
functional test employed in these studies is strin- 
gent: only strong mutations are scored as 
deleterious. 

We thank Brian Matthews, Larry Weaver, Andrew 
Morton, Cai Zhang, Sun Dao-Pin, Elisabeth Eriksson and 
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T4 lysozyme; and Robert Sauer for comments on the 
manuscript. We thank J. Miller and W. McClain for 
supplying amber suppressor-bearing plasmids. We thank 
Jeff Barbon for technical assistance. This research was 
supported by grant AI24083 from NIH. . 
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Mutagenesis of protein-encoding sequences occurs ubiquitously; it 
enables evolution, accumulates during aging, and is associated 
with disease. Many biotechnological methods exploit random 
mutations to evolve novel proteins. To quantitate protein toler- 
ance to random change, it is vital to understand the probability 
that a random amino acid replacement wilt lead to a protein's 
functional inactivation. We define this probability as the "x fac- 
tor." Here, we develop a broadly applicable approach to calculate 
x factors and demonstrate this method using the human DNA 
repair enzyme 3-methyladenine DNA glycosylase (AAG). Three 
gene-wide mutagenesis libraries were created, each with 10 s 
diversity and averaging 2.2, 4.6, and 6.2 random amino acid 
changes per mutant. After determining the percentage of func- 
tional mutants in each library using high-stringency selection 
(> 19,000-fold), the x factor was found to be 34% ± 6%. Remark- 
ably, reana lysis of data from studies of diverse proteins reveals 
similar inactivation probabilities. To delineate the nature of toler- 
ated amino acid substitutions, we sequenced 244 surviving AAG 
mutants. The 920 tolerated substitutions were characterized by 
substitutability index and mapped onto the AAG primary, second- 
ary, and known tertiary structures. Evolutionary conserved resi- 
dues show low substitutability indices. In AAG, 0 strands are on 
average less substitutable than a helices; and surface loops that are 
not involved in DNA binding are the most substitutable. Our results 
are relevant to such diverse topics as applied molecular evolution, 
the rate of introduction of deleterious alleles into genomes in 
evolutionary history, and organisms' tolerance of mutational burden. 

A fundamental aspect of evolution is that mutations generate 
novel alleles that are then favored by selection. However, 
new coding mutations can be deleterious, neutral, or beneficial. 
Mutations can result from environmental and endogenous dam- 
age to DNA and from errors during DNA synthetic processes. In 
humans, random mutations produce inherited diseases and 
accumulate with aging and cancer (1). Conversely, targeted 
hypermutagenesis by immune defenses helps to generate anti- 
body diversity and was recently shown to inactivate retroviral 
genomes (2). John Maynard Smith (3) proposed more than 30 
years ago that the occurrence of functional mutant proteins that 
differ from wild type by one residue is likely frequent for 
evolution to be possible. Since then, numerous evolutionary and 
mutagenesis studies have led to the assertion that proteins are 
highly plastic in tolerating amino acid changes (4, 5). However, 
to date, we lack a quantitative measure of the degree of proteins' 
tolerance for random amino acid changes that occur at a random 
position in the protein. If a rigorous measure of proteins' degree 
of tolerance of random amino acid changes can be defined, then 
such fundamental calculations as the steepness of protein fitness 
landscapes or the rate of introduction of deleterious mutations 
into coding genomes can be more clearly delineated. Further 
understanding of the nature of tolerated amino acid substitu- 
tions can also lend insight into protein folding and design. 

Here, we develop the concept of the probability of inactivating 
a protein with a random codon replacement producing amino 
acid change at a random location along* its sequence. For 
conciseness, this concept is named the jc factor. We describe an 
analytical method for calculating the x factor of proteins from 
randomly mutated libraries and demonstrate the method using 
the human DNA repair enzyme 3-methyladenine DNA glyco- 



sylase [AAG, methyl purine DNA glycosylase (MPG), and alkyl 
purine DNA glycosylase (ANPG)]. Nine hundred and twenty 
tolerated amino acid substitutions in active mutant enzymes 
were identified and substitutions were mapped to the available 
x-ray crystal structure of AAG. We examine the applicability of 
the x factor concept to diverse proteins by reanalyzing results 
from prior studies. These findings reveal a similar range of 
inactivation probabilities. 

Materials and Methods 

Escherichia coli strain MV1932 (ada alkAl) was previously 
derived from strain AB1157 (6). Chemicals were from Sigma- 
Aldrich, enzymes were from NEB (Beverly, MA), and DNA 
oligonucleotides were purchased from IDT (Coralville, IA), 
unless otherwise indicated. 

Construction of PCR Mutagenesis Libraries. The low, medium, and 
highly mutated AAG libraries were generated by using a previ- 
ously undescribed PCR mutagenesis protocol that produces 
similar mutational frequencies at G:C and A:T base pairs. 
Briefly, PCR mutagenesis was carried out sequentially with 
Mutazyme in the genemorph kit (Stratagene), which prefer- 
entially mutates at G:C sites, and with TaqDNA polymerase with 
0.5 mM Mn ++ and dNTP bias, which prefers A:T (7). Libraries 
were cloned into a pUC-based plasmid and transformed into 
M VI 932 for genetic complementation. Mutants from each 
library were sequenced before methyl methanesulfonate (MMS) 
selection. For detailed PCR mutagenesis and cloning methods, 
refer to Supporting Methods and Tables 4-6, which are published 
as supporting information on the PNAS web site. 

Genetic Selection for Active Enzymes. M VI 932 cells were trans- 
formed with pGRFP2-AAG, low, medium, and high libraries 
and with empty pGRFP2 vector and grown to confluence, 
diluted 1:100, and grown to midlogarithmic phase in LB- 
carbenicillin (LB-carb) at 37°C. Cultures were treated with 0.2% 
MMS for 1 hr, and the drug was washed away. Pretreated and 
posttreated cultures were serially diluted and plated on LB-carb 
in triplicate to calculate survival means and standard deviations. 
The fractions of surviving clones in libraries were normalized to 
wild-type survival. The dose of MMS used was within a range of 
drug in which the library and control populations were propor- 
tionally affected. MMS sensitivity assays were also performed at 
0.15% and 0.25% MMS; and the percent library survivals 
relative to controls and each other were similar at these MMS 
concentrations (data not shown). 

Supporting Methods present methods used for (/) PCR mutagen- 
esis, (//) DNA sequencing, (Hi) AAG protein activity assay, and (iv) 
x factor calculation and protein substitutability visualization. 

Results and Discussion 

Calculating Protein Tolerance to Random Amino Acid Substitutions. 

The probability of protein inactivation with one random amino 
acid substitution, the x factor (x S ub), can be calculated from the 



Abbreviations: AAG, human 3-methyladenine DNA glycosylase; MMS, methyl methanesul- 
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Table 1. Calculating the x factor 



% of library with (n) number of amino acid changes (f n x 100) Average 
mutation Library % Indels % Survival x factor 

0 1 2 3 4 5 6 7 8 9 10 11 12 frequency size (/X 100) (5 x 100) (x SU b) 



WT-AAG 100 — 






























100 i 8.6 








Low 5 20 


40 


20 


15 
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0 
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0 
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2.2 


2 X 10 5 


6.1 


32.7 ± 3.5 


0.39 
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0.038 


Medium 5.6 5.6 


5.6 


11.1 


5.6 


33.3 


22.2 


5.6 


0 


5.6 


0 


0 


0 


4.6 


1 x 10 5 


9.9 


18.2 ± 3.3 


0.30 




0.052 


High 0 3.6 


7.1 


10.7 


3.6 


10.7 


14,3 


10.7 


17.9 


14.3 


3.6 


3.6 


0 


6.2 


0.9 x 10 5 


5.5 


10.7 ± 2.3 


0.33 




6.034 


Vector only — . — 






























0.0051 ± 0.00017 
Average x factor 


0.34 




0.06 



Distribution of amino acid mutation load frequencies (f„), library survival (5), and x factors Umb) in the low, medium, and highly mutated libraries. Indels (/) 
are expected to produce nearly 100% inactivation and are thus subtracted from the unadjusted x factors {x T ) to yield x factors due to amino acid substitutions 
(xsut>). As expected, increasing average mutation load results in lower percentage of active enzymes. 



fractions of mutants (amino acid mutation load frequencies, /„) 
with (n) number of amino acid changes within a gene-wide 
randomly mutated library, and from the proportion of mutants 
that survive functional selection (5). For example, /o denotes the 
fraction of the unselected library with 0-aa change,/i denotes the 
fraction with 1-aa change, and so on: 

/oU -x T )° -x T y +/ 2 (1 -x T ) 2 +/ fl (l -x T ) n + • • * 

= SoT*Zf n (l-x T T=S, [1] 

where xj is the total protein inactivation probability with random 
amino acid change, including frameshifts (indels, i).xr can be solved 
after experimental determination of the /„, 5, and / values. Indels 
are found at low percentages in random mutagenesis libraries, but 
invariably produce protein inactivation. To determine the true x 
factor (jc sub ) resulting only from a random codon substitution 
(missense or nonsense mutation), the indel fraction in the total 
mutational pool (/) is subtracted from x T to obtain x sub . 

X suh = X T -i [2] 

To measure the probability of inactivation by random amino acid 
substitutions, we used the gene encoding the human AAG. AAG 
protects cells against DNA alkylation damage by excising alkylated 
base lesions including 3-methyladenine, 7-methylguanine, and Iffi- 
ethenoadenine (eA) (8). The 894-bp AAG cDNA encodes a 298-aa 
33-kDa monomelic protein that complements the DNA alkylation 
repair-deficient strain MV1932 (ada alkAl) (6) against toxicity 
induced by the alkylating drug MMS (9). Under MMS challenge, 
MV1932 cells expressing AAG from our pUC based vector exhibit 
> 19,000-fold survival advantage over non-A AG-expressing 
MV1932 controls (Table 1), thus providing a stringent and specific 
selection for active mutant AAG enzymes. 

The crystal structure of the catalytically competent AN79 
(residues 80-298) AAG protein complexed with an 1,/V 6 - 
ethenoadenine substrate oligo reveals that the enzyme binds to 
DNA via a flat positively charged face. A /3-hairpin extends into 
the DNA minor groove and flips the targeted nucleotide into the 
enzyme active site (10, 11). A water molecule is deprotonated by 
Glu-125 to form a hydroxy] nucleophile that cleaves the glyco- 
sylic bond between the damaged base and the sugar. The 
resulting abasic site is later cleaved and replaced with a normal 
nucleotide by the subsequent actions of an endonuclease, a DNA 
polymerase, and a DNA ligase (8). 

We used PCR mutagenesis to generate low, medium, and 
highly mutated AAG cDNA libraries averaging 2.2-, 4.6-, and 
6.2-aa changes per gene (a change is defined as a missense, 
nonsense, or indel). Sequencing of 20, 18, and 28 mutants from 
each unselected library revealed the/„ and lvalues of each library 
(Table 1). Expression of AAG and AAG mutant libraries 
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protected MV1932 cells against MMS-induced cell death. The 
fractional survival of each library relative to wild-type yielded 
the S values (Table 1). Solving for x sub using Eqs. 1 and 2 yielded 
the x factors (x sub ) of the low, medium, and high libraries at: 39% 
± 4%, 30% ± 5%, and 33% ± 3% (mean ± standard deviation), 
respectively. The x factors from the three libraries are within the 
95% confidence interval of each other. The average x factor is 
34% ± 6%. Thus, the overall probability of inactivating AAG 
with a single random amino acid change occurring randomly in 
the protein is «34%, or one-third (Table 1). 

The x Factor and the Substitutability of Proteins. Using three differ- 
ent libraries, we obtained a consistent value for the probability that 
a random amino acid change will inactivate AAG. Our findings beg 
the question of whether a similar* factor is seen in other proteins. 
It may be argued that the wide range of protein functions should 
demand drastically different mutabilities of various proteins. On the 
other hand, proteins face essentially similar requirements, such as 
the need to properly fold into soluble globular structures necessary 
for function (12). General types of changes leading to unfolding 
would inactivate various proteins. To address these questions, we 
reanalyzed data from diverse published studies and calculated 
inactivation probabilities. First, we examined random oligonucleo- 
tide mutagenesis studies in which mutations were targeted to the 
catalytic center of enzymes, and from which /„ and S data are 
available (13-18). We reasoned that these critical segments are 
expected to tolerate few substitutions. The results from human, 
bacterial, and viral enzymes are shown in Table 2. Despite the 
different enzymes and selection systems used, inactivation proba- 
bilities within these sensitive regions range from 44% to as high as 
81%, averaging «60%, thus supporting our hypothesis. Second, 
Markiewicz and coworkers (19) examined 12 or 13 different amino 
acid substitutions at each residue across 90% of the 360-aa E. coli 
lac repressor protein using amber codon suppressor strains, which 
often corresponded to two or three nucleotide changes per codon. 
In our reanalysis of their data, we counted close to 1,380 single 
mutants that were inactive, <**20% of which were temperature 
sensitive, of a total of 4,049 examined. This yielded an x factor for 
the lac repressor gene of 34%, which correlates well with our results 
for human AAG. Third, the x factor of a protein is conceptually 
similar to the proportion of new deleterious alleles that arise during 
the evolution of the source organism. Eyre-Walker and Keightley 
(20) calculated the percentage of deleterious substitution mutations 
that were eliminated from the human lineage by purifying selection. 
They examined synonymous and nonsynonymous substitution rates 
from coding regions of 46 homologous proteins from humans and 
chimpanzees. Interestingly, they conclude that at least 38% of 
spontaneous mutations in the human lineage were sufficiently 
deleterious to have been eliminated by selection (20). Together, 
these findings from multiple and independent experimental ap- 
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Table 2. x values calculated from active-site targeted cassette mutagenesis studies 





Protein 


Organism 


rruictn region 

(amino acid no.) 


Average 
dUUdinu iiuri 
frequency 


Qi t m it* a 1 
jUrVlveH 

fraction (5) 


x value 


- 

Ref. 


% '~4 
, '* * 


DN A polymerase tj 


Homo sapiens 


52-73 


3.8 


0.02 


0.81 


13 




DNA polymerase 1 


Thermus aquaticus 


605-617 (Motif A) 


3.8 


0.04 


0.59 


14 




1 Thymidylate synthase 


H. sapiens 


196-199, 204-212 


4.2 


0.1 


; 0.6 


15 




Reverse transcriptase 


HIV 


67-78 (03 04 loop) 


4.1 


0.11 


0.59 


16 




1 DNA polymerase 1 


Thermus aquaticus 


659-671 (0 helix) 


2.7 


0.11 


0.8 


17 




Thymidine kinase 


Herpes Simplex-1 


155, 161-165 


2.4 


0.32 


0.44 


18 



, Available f n and S values were used to derive the inactivation probabilities. Although the various complementation systems may 
require differing levels of minimal enzyme activity, nevertheless the x values were >34% due to the concentration of mutations near 
the enzyme active sites. 



proaches suggest a range of similar jc factors over the length of 
diverse proteins. 

Enzyme inactivation can result from indirect structure dis- 
rupting mutations or from direct alterations of the catalytic 
mechanism. The AAG functional assay is sensitive to both 
modalities. Minimal AAG activity necessary for complementa- 
tion was assessed by measuring initial reaction rates under 
saturating substrate conditions in lysates of 10 random surviving 
clones. The results indicated that ~5-10% of wild-type activity 
is necessary for survival at the MMS dose used (data not shown). 
Hydrophobic/hydrophilic properties appear to be crucial overall 
determinants of protein structure (5, 12). The buried core is 
sensitive to nonhydrophobic changes and those that disrupt 
packing, whereas the solvent-accessible surface is generally more 
tolerant of change. Residue size, charge, hydrogen-bonding 
characteristics, and bond angle flexibility are other folding 
factors that may be perturbed by random substitutions. 

AAG is a simple monomeric protein. Larger proteins with 
multiple functional domains and multiple interacting partners may 
exhibit more complex inactivation dynamics. The x factor calcula- 
tion assumes that the effects of multiple mutations are independent, 
in that the effects of mutations on protein function are largely 
additive. This is supported by findings on the A repressor (21). 
However, at higher mutational loads effects of mutations may 
interact in more complex ways, with increased possibility of com- 
pensatory or synergistic effects. These results with AAG may 
slightly underestimate the jc factor, because the N-terminal 79 aa of 
AAG are not required for enzymatic activity. Tolerated substitu- 
tions are slightly elevated in this N-terminal region. Nevertheless, 
protein-folding principles apply, and mutations in this region that 
cause overall misfolding or aggregation will produce inactivation. 

There likely are variations in the substitutability of different 
proteins. The hydrophobic core is generally less tolerant of 
change than the solvent accessible exterior (5). Therefore, x 
factors may also be influenced by protein sizes and surface-to- 
volume ratios. Axe and coworkers found that 5% of single amino 
acid substitutions lead to an inactivated barnase enzyme (22). 
Rennell et ah (23) found that ~16% of amino acid substitutions 
in T4 lysozyme caused inactivation. The differences from the 
above findings may be attributed to barnase and T4 lysozyme 
small sizes, which are 110 and 164 aa, respectively. Highly 
conserved proteins such as histones are likely to be relatively 
intolerant to mutation, whereas protein domains such as F v 
regions of antibodies may exhibit increased tolerance against 
misfolding. Residues that are posttranslationally modified are 
also expected to be intolerant of change. 

The x factor is calculated for amino acid replacements and can 
include the generation of stop codons. The frequencies of stop 
codons in the low, medium, and high libraries are 4%, 9%, and 
7.5%, respectively. The x factor can be converted for single- 
nucleotide substitutions. Largely due to degeneracy at the third 



position, the nucleotide x factor is expected to be less than the 
amino acid x factor. Multiplying the amino acid* factor of «*34% 
by the probability of nonsynonymous codon change accessible by 
one nucleotide (415/549) yields the nucleotide jc factor of «*26%. 

Our mutagenesis scheme of creating predominantly random 
single nucleotide substitutions mimics the generation of natural 
diversity. Three naturally occurring human single-nucleotide 
polymorphisms arose in our database of tolerated AAG substi- 
tutions: P64L, T199A, and A258V. These variations did not 
exhibit appreciable effects on MV1932 complementation when 
individually assayed (data not shown). 

Substitutability and Structure. Previously, we have focused on the 
probability of amino acid changes being inactivating. We have 
also examined situations in which amino acid substitutions are 
tolerated. To analyze the nature of tolerated substitutions, we 
sequenced 244 mutant AAG cDNAs from the highly mutated 
library that complemented MV1932. This yielded a total of 920 
tolerated amino acid changes. Fig. 1 maps the mutations along 
the AAG primary sequence. The types of tolerated amino acid 
substitutions at each position are indicated. Residues without 
bars reflect zero identified substitutions. 

A residue's "substitutability index" is defined as the percent 
sequenced clones with a substitution at that residue. Many 
positions that are evolutionarily conserved are also essential for 
activity (10, 11) and did not tolerate changes in our assay. 
Examples include Glu-125, Arg-182, and Val-262, each of which 
interacts with the activated water molecule that hydrolyzes the 
sugar-base glycosylic bond. Other nonsubstituted amino acid 
residues include Tyr-162, which projects from a surface j3 hairpin 
and acts as a "nucleotide flipper." Met-164 and Tyr-165 assist in 
this base-flipping mechanism by destabilizing the base pair 
adjacent to the flipped nucleotide. Y162A, M164A, and Y165A 
single substitution mutants were generated by Lau et al. (11) and 
assayed by using a genetic complementation system. The Y162A 
mutant exhibited large impairment of glycosylase activity, 
whereas M164A and Y165A showed only moderate impairment 
(11). Correspondingly, in our study, no substitutions were ob- 
served at Tyr-162, whereas positions Met>164 and Tyr-165 
showed moderate substitutability, allowing He, Arg, and Phe 
substitutions, respectively (Fig. 1). Within the substrate-binding 
pocket, the flipped-out base stacks between the aromatic side- 
chains of Tyr-127, His-136, and Tyr-159. Y127F, H136Q, and 
Y159F mutants were also generated previously (11). Y127F 
exhibited the most profound decrease in activity, whereas Y159F 
was the least affected (11). In our data set, Tyr-127 was 
concordantly unsubstituted, and His-136 tolerated only one Tyr 
replacement. Tyr-159 was substituted by both Phe and Asn. 

There are positions in AAG that are not evolutionarily conserved 
but did not exhibit any tolerated changes. The individual spatial 
arrangements of these interactions are likely unique to AAG. 
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Fig. 1. Tolerated amino acid changes along the AAG primary sequence, shown with evolutionary conservation and secondary structures. Two hundred 
forty-four active AAG mutants were sequenced, and observed amino acid substitutions are shown above the wild-type sequence. Colored bars indicate general 
categories of amino acids. Numbers are read vertically and indicate residue position. Below the wild-type sequence, evolutionary invariant residues are marked 
by black boxes and conserved residues by gray boxes, a helices (helix), p strands (arrows), and disordered regions (dashed lines) are indicated. Homologous 
sequences (from human, mouse, rat, Borrelia burgdorferi, Bacillus subtilis, Arabidopsis thaliana, and Mycobacterium tuberculosis) were identified with psi-blast 
(10), and secondary structure calling was performed with molecular operating environment (moe, CCG, Montreal, Canada). 



Although some of these positions may display substitutions if even 
more mutants are sequenced, the structural basis for lack of 
substitutions at many of these positions highlights three general 
mechanisms: specific hydrogen-bonding interactions, unique hy- 
drophobic packing, and ion binding. For example, specific hydro- 
gen-bonding requirements are emphasized by Glu-116's interaction 
with Arg-118, which, in turn, interacts with Glu-188 and Glu-245 in 
a three-way interaction. Arg-261 provides a hydrogen pair partner 
to the evolutionary conserved and unsubstituted Asp-132. This pair 
packs adjacent to Tyr-127, which forms part of the active site 
pocket. Hydrophobic packing constraints are observed at Gly-119, 
which is at the core of a /3 strand, <4.5 A away from Leu-184. No 
other side chains can fit in this tight space. Similar packing 
constraints are observed at Leu-184, which is <4.5 A from the 
unsubstituted Leu-225. Cys-167 is buried <4.5 A from Ile-227 and 
close to the Ca of Cys-222. Mutations of buried residues may 
require concomitant mutations of other closely packed residues to 
maintain optimal packing. Interestingly, at least one mutant in our 
study appears to demonstrate this principle. It contains I170V and 
L181M substitutions that pack adjacently in the hydrophobic core. 
The conversion of Leu- 181 to the slightly bulkier methionine is 
found to coexist with the conversion of Ile-170 to the smaller valine. 



Last, lack of substitutions at Ser-171 highlights the role of ion 
binding. Ser-171's side-chain oxygen binds to a Na + ion, which has 
been postulated to enhance the structural stability of the active site 
floor (11). 

In contrast, certain regions in AAG appear highly substitut- 
able. Examples include the first 79 N-terminal residues that have 
been shown previously to be unnecessary for in vitro enzyme 
activity and DNA-binding specificity (24). Residues 80-81, 
200-207, 249-254, and 296-298 are also highly substitutable 
(Fig. 1), In accord, they display low electron density in x-ray 
crystallography and were inferred to be disordered loops (10). 

In Fig. 2, the relative substitutability indices of residues are 
mapped onto the available crystal structures of the N79A AAG 
mutant. Dark-blue residues are the least substitutable, and red 
residues are the most tolerant of change. Fig. 2 A and B shows 
surface residues, and Fig. 2 C and D facilitates views into the protein 
core. One striking feature is the general immutability of the 
DNA-interacting face and specifically, the nucleotide-f lipper Tyr- 
162 (Fig. 2^4). A surface region distant from the DNA-binding face 
(Fig. IB) was also observed to have low substitutability scores; 
Glu-188, Arg-118, Glu-245, Glu-116, and Arg-110 participate in a 
network of charged contacts that likely contribute to protein 
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Least Mutable 



Most Mutable 



Fig. 2. Substitutability of AAG amino acid residues and structure. Individual residues' substitutability scores are indicated by their color in the spectrum, with 
red being the most substitutable and dark blue the least. (A) The DNA interacting face of AAG. The DNA-binding face and intercalating Tyr-162 (*) are largely 
intolerant of substitution, whereas distant loops are generally tolerant of change. (6) Rotation of A by 180°, showing the opposite side of the AAG surface. (C 
and D) The substitutability of AAG resides shown by secondary and tertiary structure representation. Views are rotated 180° relative to each other. Residues near 
the active site of AAG, adjacent to the extrahelical and I.A^-ethenoadenine DNA lesion, are generally intolerant of change. The /34 (1 65-1 71 ) strand is indicated 
by #. Arrows point toward a helices with solvent accessible faces that exhibit greater substitutability than their buried sides. 



stability. In the protein interior, a conspicuous pattern of alternating 
unsubstituted and substitutable sites is seen in the )34 (165-171) 
strand (Figs. 1 and 2 C and D). Cys-167, Asn-169, and Ser-171 are 
relatively unsubstituted, because their side chains face toward the 
active site and are involved in substrate recognition or Na + binding 
(11). In contrast, Met-168 and Ile-170 tolerate hydrophobic sub- 
stitutions, because their side chains face the opposite direction and 
pack into the hydrophobic core. Solvent-accessible surfaces gener- 
ally exhibit higher substitutability compared with buried residues. 
This is evident in Fig. 2 C and D, where the exposed exterior sides 
of several a helices exhibit greater substitutability than their interior- 
facing sides. 

Averages of substitutability indices in different structural 
motifs are presented in Table 3. In AAG, evolutionarily con- 
served and catalytically crucial residues are significantly less 
substitutable than the rest of the protein. Norlconserved residues 
adjacent to conserved residues in the primary sequence are 
generally less substitutable than other nonconserved residues, 
reflecting their involvement in functionally important regions. 
This observation suggests they may also be fruitful targets for 



directed evolution studies. j3 strand residues, as a group, are less 
tolerant of substitution than are a helices. This may be explained 
in part by the fact that in this a-j3 protein, the )3-sheets are 
generally less solvent accessible and therefore possess fewer 
surface residues that are more likely to tolerate substitutions. 
Loops and turns, expectedly, are the most substitutable. 

Some Implications of the x Factor. We observed that various 
residues of a protein are differentially sensitive to substitutions, 
and that tolerance of the entire protein to random change can be 
defined by the x factor. The* factor is a description of an intrinsic 
property of individual proteins and protein motifs and can be a 
guiding parameter in the study of natural and artificial evolu- 
tionary processes. For example, using the estimated inactivation 
probability of ^34% and assuming mutually independent effects 
on inactivation probability by multiple mutations, the isolation of 
active mutants harboring many mutations from large random 
mutagenesis libraries (>10 5 ) is not surprising (25). In contrast, 
a single, non-3-bp indel event almost certainly leads to inacti- 
vation (x~*l). Therefore, indel frequencies should be minimized 
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Table 3. Mean substitutability indices of AAG motifs 





Motif residue substitutability 


Nonmotif residues 


rtest 


Mean ± SD 


Mean ± SD 


P value 


Entire protein 


1.38 ± 1.10 






Evolutionary conserved 


075 ±0.85 


1.53 ±1.10 


6.37E-08 


Nonconserved adjacent to conserved 


1.25 ±0.84 


1.60 ± 1.16 


0.017 


a-helices 


1.19 ±0.97 


1.41 ± 1.12 , 


0.19 


0-strands 


0.73 ± 0.76 


1.52 ±1.12 


1.01 E-08 


Turns and loops 


1.57 ±1.12 


0.93 ± 0.89 


3.46E-07 


Functionally important residues 


0.56 ± 0.60 


1.41 ± 1.11 


1.04E-04 



The substitutability index of individual residues (number of observed amino acid changes/number of active 
mutants sequenced at that position) is expressed as a percentage (X100), categorized by motifs, and averaged. 
The t test is performed against indices of motif nonmembers for differences in mean substitutability indices, 
indicative of differing importance to enzyme function. 



in efforts to evolve novel proteins from high mutation load 
libraries. Retroviruses, such as HIV, may be susceptible to 
increased mutational burden, and lethal mutagenesis of viral 
genomes by introducing mutations through the use of nucleoside 
and ribonucleotide analogs has been proposed (26). Given our 
findings, such efforts may be further enhanced by the use of 
analogs that efficiently induce frameshift mutations. Viral ge- 
nomes that encode multiple proteins as different reading frames 
of the same genetic sequence may be particularly sensitive to 
agents that generate frameshifts. 

It is estimated that the human mutation rate per coding diploid 
genome per generation is 3.2, including base substitutions, indels, 
and larger changes (27), Multiplying this number by the general x 
factor of *«34%, the rate of introducing deleterious coding alleles 
by random substitution is r**1.0 per diploid genome per sexual 
generation. This is likely an underestimate, because indels inacti- 
vate coding regions much more efficiently than base substitution 
mutations. Dominant negative mutations may also more efficiently 
produce a deleterious phenotype, although the frequency of mu- 
tations that act in a dominant negative manner is largely unknown. 
Interestingly, our deleterious coding allele rate calculation of 1.0 is 
congruent with the estimate of 1.6 independently calculated by 
Eyre- Walker and Keightley (20), which was based on the assump- 
tion of 60,000 genes in the human genome. 

Overall, our method of gene-wide random mutagenesis and 
sequencing highlights the relative importance of specific residues 
to enzyme structure and function through the numbers and types 
of tolerated substitutions. This work validates and extends from 
previous structural studies. Interestingly, the substitutability 



indices of individual residues can be obtained independently of 
conservation or structural information and are generally con- 
sistent with both. The extensive database of tolerated amino acid 
substitutions is obtained from a more expedient form of gene- 
wide study than previous techniques, such as alanine scanning. 
This database can provide a valuable resource for predicting the 
effects of mutations on protein function, which has been a focus 
of recent investigations (28, 29). 

We advance the concept of the* factor as a measure of protein 
tolerance to random substitutions. The x factor may also be 
useful in measuring genomic robustness against mutations. It has 
been hypothesized that evolvability, or the ability to generate 
heritable variation, may be favored in certain environments (30). 
Genomes experiencing high mutational burden may face selec- 
tive pressure to evolve proteins that are tolerant of change, in 
which case the observed x factors are expected to be less than x 
factors of homologous proteins from more faithfully propagated 
genomes. It may be of particular interest to examine x factors 
from various protein families and diverse organisms. 
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Abstract 

A UV/VIS spectrophotometric microtiter-plate and a filter-paper based assay using 4-(/?-nitrobenzyl)pyridine (NBP) were 
developed to determine epoxide hydrolytic activity by measuring the decrease of the epoxide concentration. Both systems 
were applied for screening an expression gene bank of Rhodococcus sp. NCIMB 11216. As a reference, whole cells from 
Rhodococcus sp. NCIMB 1 1216 and Beauveria sulfurescens ATCC 7159 exhibiting epoxide hydrolase activity were used. The 
microtiter-plate system was also evaluated for different epoxides and performed in a laboratory robotic system for high 
throughput screening. The microtiter-plate assay showed a high sensitivity for the detection of small concentrations of 
epoxides (0.1-1 mg/well) such as styrene oxide, ethyl phenylglycidate, n-hexane oxide and indene oxide. The filter paper 
assay was further optimized for styrene oxide. Both assays were suitable to screen within libraries of epoxide hydrolases 
without interference with other enzymes such as esterases, lipases or proteases. The assay should allow to screen large 
libraries obtained by directed evolution, strain collections and (expression) gene banks for epoxide hydrolytic activity or to 
monitor the purification process of an epoxide hydrolase. © 1999 Elsevier Science B.V. All rights reserved. 

Keywords: Colorimetric assay; Epoxide hydrolase; 4-(p-nitrobenzyl)pyridine; High throughput screening 



1. Introduction 

Epoxides are useful chiral compounds for organic 
chemistry and used as valuable intermediates for the 
synthesis of enantiomerically pure pharmaceutically 
active substances. They can be prepared by several 
chemical methods such as the Sharp less epoxidation 
of allylic alcohols [1] or by the method of Jacobsen 
and Katsuki [2]. They . can also be obtained enantio- 
merically pure by microbial reduction of a-haloke- 
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tones, by microbial degradation of halohydrins or by 
monooxygenase reactions catalyzed by different P450 
species [3,4]. These epoxides are often cleaved by 
microsomal or cytosolic epoxide hydrolases which 
play an important role in the detoxification of xeno- 
biotics [5]. Epoxide hydrolases (EC 3.3.2.3) belong to 
the class of a/p-hydrolase-fold enzymes. The reaction 
mechanism displays similarities to dehalogenases 
[6,7]. Different microbial and fungal epoxide hydro- 
lase are known from literature, e.g. from Rhodococcus 
sp. NCIMB 11216 [8], Aspergillus niger LCP 521 [9], 
Beauveria sulfurescens ATCC 7159 [10] and Nocar- 
dia EH1 [11]. Recently, the first recombinant bacterial 
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epoxide hydrolases were cloned from Agrobacterium 
radiobacter AD1 [12] as well as from Corynebacter- 
ium sp. C 12 [13] and the former was expressed in E. 
coli. 

To determine epoxide hydrolase activity and enan- 
tioselectivity in hydrolysis, different methods were 
used such as liquid chromatography (LC) [14], gas 
chromatography (GC) [10] or UV/VIS [15]. LC and 
GC methods are not suitable to screen large mutant 
enzyme libraries obtained by directed evolution via 
error-prone polymerase chain reaction (PCR) [16] or 
gene shuffling [17], within strain collections or 
(expression) gene banks due to complicated sample 
preparation and time consuming analysis. UV/VIS spec- 
trophotometry methods are more convenient because of 
simple and fast measurement which enables screening 
in high throughput systems. For filter paper assays, a 
chromophore indicating the hydrolysis of an epoxide 
in the visible area would be most suitable. Current 
assays determine the hydrolysis of epoxides by UV 
measurement and only the decrease of absorption is 
monitored. Unfortunately, the absorption coefficients 
are usually small [18,19]. Other assay substances such 
as epoxy carbonates, which release chromophores like 
/?-nitrophenol are only suitable in the absence of other 
hydrolases such as esterases or lipases, which can 
cleave the carbonate. As a consequence, purified 
epoxide hydrolases are necessary [15]. 

For screening, engineering and for purification of 
new epoxide hydrolases, a fast and accurate assay is 
required which enables a high throughput. For the 
measurement of alkylating substances, the 4-(p-nitro- 
benzyl)pyridine (NBP) test is well known. It was 
applied to a variety of different alkylating substances 
such as reactive halogen compounds, carboxylic acid 
chlorides, organic phosphoric compounds, aziridines 
and epoxides [20]. The alkylation of NBP with styrene 
oxide analogs in vitro [21] and with chloroethylene 
oxide [22] were studied. A correlation of alkylating 
and mutagenic activities of allylic compounds was 
found using an NBP alkylating procedure [23]. Also, 
the mutagenic potential of slow-reacting epoxides was 
determined using NBP [24], The NBP test was applied 
for the determination of microsomal epoxide hydro- 
lase activity with safrole oxide (SAFO) as substrate 
after extraction/workup of the epoxide from the incu- 
bation medium [25], A colorimetric assay and a thin 
layer assay using the NBP test after extraction of the 



reaction mixture with organic solvent and boiling of 
the assay mixture was developed [26]. 

In the present paper, we describe a modified NBP 
test using microtiter-plates together with a pipetting 
robot which can be performed directly with whole or 
lysed cells containing other hydrolytic enzymes with- 
out any extraction or separation steps at mild assay 
conditions in microtiter-plates or by using a filter- 
paper based assay for direct screening on agar plates. 

2. Experimental 

2.1. Reagents 

All epoxides were obtained from Aldrich (Stein- 
heim, Germany) at the highest purity available. Indene 
oxide was prepared using Af-bromosuccinimide. The 
crude bromohydrin was treated with sodium bicarbo- 
nate according to literature [10] and the epoxide was 
purified by flash chromatography on silica gel (pet- 
roleum ether:ethyl acetate=9:l, /? f =0.55, yield 61%). 
! H- and I3 C-NMR spectra were in agreement with 
literature [10]. All other reagents were purchased from 
Fluka (Buchs, Switzerland) at the highest purity avail- 
able. 

2.2. Apparatus 

UV/VIS spectrophotometry measurements were 
performed on a ICN Titertek MS 212 (ICN, Meck- 
enheim, Germany) microtiter-plate reader at 560 nm 
(reference filter 650 nm) using standard 96 well 
microtiter-plates. Activity was calculated from the 
difference in absorbance at 560 and 650 nm. 

GC analysis was performed on a FS-Cyclodex P-I/P 
CS-Fused Silica capillary column (CS -Chromatogra- 
phic Service GmbH, Langerwehe, Germany) using H 2 
as carrier gas (split 1:100, 130 kPa) at 75°C for styrene 
oxide and at 140°C for phenyl 1,2-ethandiol. 

In case of the gene bank of Rhodococcus sp. 
NCIMB 11216, assays were done using a Biomek 
2000 Workstation equipped with a Biomek SL side- 
loader with Beckman incubator and the software 
package BioWorks 2.2 from Beckman (Miinchen, 
Germany). 

Normal Whatman filter paper (Springfield Mill, 
UK) was used for the agar plate assay. 
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2.3. Strains 

Rhodococcus sp. NCIMB 11216 and Beauveria 
sulfurescens ATCC 7159 were cultivated in LB media 
(yeast extract 5 g/1, tryptone 10 g/1, NaCl 10 g/1, pH 
7.0) at 30°C and 210 rpm in 2 1 shaking flasks for 2 
days. E. coli DH5a [F~ endAl hsdR17 (r^m^) supE44 
thi-1 recAl gyrA96 relAl A(lac-ZYA-argF)U169 
<|>80d/acZAM15 X~] was cultivated in LB media at 
30°C until OD 600 nm reached 0.35. Cells were har- 
vested by centrifugation at 4000 rpm at 0°C. Cell lysis 
was performed by addition of 10 ml lysozyme solution 
(0.1 M Tris/HCl buffer, 300 mM NaCl, 1 mg/ml lyso- 
zyme) at 0°C for 1 h to the remaining cell pellet. 

2.4. Recombinant DNA techniques 

Rhodococcus sp. NCIMB 11216 cells were ly sed by 
the addition of lysozyme as described above. Chro- 
mosomal DNA was partially digested by Sau 3 A and 
ligated to vector pUC 19 that had been digested with 
Bam HI and dephoshorylated with alkaline phospha- 
tase from calf intestine. E. coli DH5ct was transformed 
with the ligation mixture and plated onto LB agar 
plates containing 100^g/ml ampicillin. Transfor- 
mants were subsequently replica-plated to these 
LB -ampicillin plates, which were used to screen for 
colonies with epoxide hydrolase activity. All opera- 
tions were done using standard procedures [27]. 
Restriction enzymes, alkaline phosphatase and pUC 
19 were obtained from MBI Fermentas (St. Leon-Rot, 
Germany). 

2.5. Assay procedures 

2.5 J. Assay in 96 well microtiter-plates 

For validation, different amounts of styrene oxide 
(0.17 M), phenyl ethylglycidate (0.14 M), indene 
oxide (0.16 M), n-decane oxide (0.1 M) and n-hexane 
oxide (0.8 M) were dissolved in acetone and used as 
stock solutions. Stock solution (40 mg) was added to 
100 ul of LB media. Different amounts of triethyl- 
amine (0, 18 or 36 mg) or piperidine (0, 22 or 43 mg) 
and 50 mg of triethylene glycol dimethylether were 
added. Then 50 mg of 4-(p-nitrobenzyl)pyridine solu- 
tion (0.23 M NBP in methoxyethanol) was added and 
the reaction mixture was incubated at 39°C. The 
absorbance at 560 nm was measured at different times 



against 650 nm as reference wavelength and the dif- 
ference in absorption was calculated. 

Assay with E. coli DH5a cells: A stock solution of 
styrene oxide (1.3 M in acetone) was prepared and 
diluted with acetone (1.3, 0.65, 0.32, 0.16, and 
0.08 M). These dilutions (40 mg) and 50 \i\ of whole 
or lysed E. coli DH5a cells were used and the assay 
was performed as described above. 

2.5.2. Assay on filter paper 

The filter papers were preincubated for 10 min in a 
glass petri dish containing a styrene oxide solution 
(90 mM in acetone). After air drying for 20 min, the 
agar plates were replica-plated and the colonies were 
transferred from the agar plates by manual -pressing 
the filter paper on the agar. The filter paper was incu- 
bated for 30 min at 37°C in a closed glass petri dish for 
epoxide hydrolysis. (Care should be taken due to the 
high toxicity of styrene oxide!) For developing, the 
filter paper was incubated in 2 ml of NBP solution 
(0.23 M NBP in methoxyethanol) for 1 min, air dried 
for 30 min and incubated for 30 min at 80°C in a 
closed glass petri dish. Hydrolysis activity could be 
monitored by the formation of a colorless zone on the 
blue filter paper. 

3. Results and discussion 

For the validation of the NBP assay in microtiter- 
plates, different epoxides were used (styrene oxide, 
. phenyl ethylglycidate, indene oxide, n-hexane oxide 
and n-decane oxide). The assay showed a high sensi- 
tivity for all epoxides except for n-decane oxide 
because of the low alkylating strength of this epoxide. 
A linear correlation between the amount of epoxide 
and absorbance was found for all epoxides (data for n- 
hexane oxide not shown), with the exception of n- 
decane oxide (data not shown), in the microti ter-plate 
assay (Figs. 1-3). In the literature, a stabilizing effect 
by the addition of different bases especially piperidine 
to the assay mixture has been described [20]. In 
contrast, we observed that triethylamine or no addition 
of base gave better results. The addition of piperidine 
even led to a decrease in absorbance. 

Although the fastest reaction in the NBP assay was 
observed for indene oxide compared to the other 
epoxides, the assay was further optimized for styrene 
oxide, because an easy evaluation of the results was 
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Styrene oxide [mg/well] 

Fig. 1. Effect of incubation of styrene oxide on absorbance (60 min, 39°C) without base (■), with piperidine (25 fil A. 50 \x\ •) and with 
triethylamine (25|il 50^1 T). 



possible by GC analysis. A clear correlation of the 
decrease in styrene oxide concentration as determined 
by the NBP assay with the consumption of styrene 
oxide and concomitant formation of the product phe- 
nyl 1,2-ethandiol as determined by GC analysis was 
observed. 

The absorbance of reaction mixtures without and 
with different concentrations of styrene oxide was 
measured in the presence of E. coli DH5a cells in 
LB media. In the presence of E. coli cells, a linear 
correlation between the amount of styrene oxide and 



the absorbance could be obtained after 45 min 
(Fig. 4). The error introduced by measurement of 
the absorbance after 45 min is within ±3% for three 
control experiments under the same conditions. This 
means, that it is not necessary to do any extraction or 
work-up of the incubation mixture to remove the cells. 
The assay was also found to be accurate in the pre- 
sence of lysed E. coli DH5a cells. Even in the pre- 
sence of lysis products and remaining cell debris, a 
proportionality between the amount of styrene oxide 
and absorption was observed (Fig. 4). 
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Fig. 2. Influence of reaction time on the hydrolysis of styrene oxide, and hence absorbance at different concentrations of bases (39°C, 0.62 mg 
styrene oxide/well) without base (■), with piperidine (25 y>\ A. 50 uJ •) and with triethylamine (25 \i\ ♦> 50 u.l ▼). 
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Epoxide [mg/well] 

Fig. 3. Absorbance at different concentrations of indene oxide without base (■), with triethylamine (50 |jl •) and ethyl phenylglycidate 
without base (A) and with triethylamine (25\x\ T) (60 min incubation, 39°C). 



Assays were also performed using 100 |il of a 
suspension of whole cells from Rhodococcus sp. 
NCIMB 11216 and Beauveria sulfurescens ATCC 
7159 in LB media. After 2 h of incubation at 39°C 
in the presence of different dilutions of styrene 
oxide (same conditions as above), the assay was 
performed as described in Section 2 and showed 
the disappearance of styrene oxide compared with 
a blank without cells. Epoxide hydrolysis activity 
could even be observed in a mixture contain- 
ing 33% or 50% acetone using whole cells of both 
microorganisms. 
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Fig. 4. Absorbance in the presence of E. coli DH5a after NBP assay (#), without cells/NBP (■) and after cell lysis/NBP (A) at different 
concentrations of styrene oxide per well (45 min incubation, 39°C). 



In principle the epoxide hydrolase assay based on 
NPB might interfere with other enzymatic activities 
such as glutathione-transferases or reductases present 
in whole microorganisms or crude lysed cells. How- 
ever, this we exclude due to the correlation between 
results obtained by the NPB assay and GC analysis 
(see above). 

The values obtained from the NBP assay were 
further verified by using whole or lysed cells from 
E. coli DH5a and boiled Rhodococcus sp. cells which 
showed no epoxide hydrolase depletion. Furthermore, 
no spontaneous reaction with nucleophilic compounds 
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Fig. 5. Control experiment using whole (top row) and boiled 
(bottom row) cells from Rhodococcus sp. Increasing styrene oxide 
concentrations from left to right. 

(e.g. water, proteins, DNA) in these mixtures under the 
assay conditions and within the reaction time could be 
observed (Fig. 5). 

In order to screen for epoxide hydrolase activity 
directly on agar plates, a filter paper assay was used. 
Here, Rhodococcus sp. NCIMB 11216 or E, coli 
DH5cc colonies of a gene bank containing genes from 
the same Rhodococcus sp. were grown on LB agar 
plates. The colonies were then transferred to filter 
paper preincubated with styrene oxide as given in 
Section 2. Epoxide hydrolase activity could easily 
be detected by the formation of colorless halos on 
the remaining blue filter paper caused by the hydro- 
lysis of styrene oxide by the epoxide hydrolase from 
Rhodococcus sp. NCIMB 11216 (Fig. 6). 

4. Conclusions 

Two novel assay formats useful for screening of 
epoxide hydrolase activity in whole microorganisms 
(intact, lysed cells or containing gene banks encoding 
epoxide hydrolase activity) were developed. The 
microtiter-plate assay should also allow to rapidly 
and quantitatively monitor the activity of epoxide 




Control Rhodococcus sp. 



Fig. 6. Filter paper assay without enzyme as control (left) and with 
styrene oxide after incubation in the presence of whole cells from 
Rhodococcus sp. (right). 
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hydrolases during purification procedures. For both 
applications, a fast and accurate procedure is impor- 
tant, because large numbers of samples have to be 
analyzed in reasonable time. The microtiter-plate 
assay is easy to perform with the help of laboratory 
automation techniques and large libraries can be 
screened fast with high accuracy and reproducibility. 
The microtiter-plate assay can be applied to different 
epoxides such as styrene oxide, ethyl phenylglycidate, 
indene oxide or n-hexane oxide representing different 
classes of substituted epoxides. Both assays should 
also be transferable to other epoxides having suffi- 
ciently high alkylating strength. By the use of enan- 
tiomerically pure epoxides, it should also be possible 
to identify enzymes with high enantioselectivity and 
desired enantiopreference. Furthermore, the filter 
paper assay enables a direct screening on agar plates, 
which is especially suitable for the screening of large 
numbers of colonies without transferring them to 
microti ter-plates. 
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KLAUS-JORG RTEGER 1 ' 2 *, M OH AM ED EL-ALAMA 2 t, GEORG STEIN 3 , CHARLES BRADSHAW 2 f, 
PIOTR P. SLONIMSKI 1 AND KINSEY MAUNDRELL 2 f 

1 Centre de Genetique Moleculaire du Centre Notional de la Recherche Scientifique, Laboratoire Propre Associe a 
L'Universite Pierre ex Marie Curie, F-91198 Gif-sur-Yvette, France 

2 Geneva Biomedical Research Institute, Glaxo-Wellcome, 14 chemin des Aulx, CH-1228 Plan-les-Ouates, Geneva, 
Switzerland 

y lnstitut fur Botanik I, Abt. Bo tan. Cytologie und Morphologic Heinrich- Heine- Universitat Diisseldorf, 
Universitatsstr. I, D-40225 Dusseldorf, Germany 



By now, the EUROFAN programme for the functional analysis of genes from the yeast genome has attained its 
cruising speed. Indeed, several hundreds of yeast mutants with no phenotype as tested by growth on standard media 
and no significant sequence similarity to proteins of known function are available through the efforts of various 
laboratories. Based on the methodology initiated during the pilot project on yeast chromosome III {Yeast 13, 
1547-1562, 1997) we adapted it to High Throughput Screening (HTS), using robotics. The first 100 different gene 
deletions from EUROSCARF, constructed in an FY1679 strain background, were run against a collection of about 
300 inhibitors. Many of these inhibitors have not been reported until now to interfere in vivo with growth of 
Saccharomyces cerevisiae. In the present paper we provide a list of novel growth conditions and a compilation of 49 
yeast deletants (from chromosomes II, IV, VII, X, XIV, XV) corresponding to 58% of the analysed genes, with at 
least one clear and stringent phenotype. The majority of these deletants are sensitive to one or two compounds 
(monotropic phenotype) while a distinct subclass of deletants displays a hyper-pleiotropic phenotype with 
sensitivities to a dozen or more compounds. Therefore, chemotyping of unknown genes with a large spectrum of 
drugs opens new vistas for a more in-depth functional analysis and a more precise definition of molecular targets. 
Copyright © 1999 John Wiley & Sons, Ltd, 

key words — drug-sensitivity/resistance; functional analysis; genome; robotics; Saccharomyces cerevisiae 



INTRODUCTION 

Currently, less than half of the genes of the Sac- 
charomyces cerevisiae genome have been character- 
ized, either genetically and/or biochemically. The 
majority of the remaining genes have no significant 
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similarity to genes of known function. In addition, 
deletion of these genes often does not lead to 
obvious growth defects or other detectable pheno- 
typic characteristics, as tested by growth on 
standard media. The search for phenotypes by 
screening yeast mutants against a standardized 
compound collection is a logical step to increase 
our knowledge of a particular gene and its 
function. 

Recently we reported a microtiterplate-based 
screening for phenotypes of functionally uncharac- 
terized genes from yeast chromosome III (Rieger 
et aL % 1997, 1998). In the framework of the EU- 
ROFAN programme, we have now extended our 
previous approach in two ways: first, we have used 
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an enlarged panel of inhibitors and growth con- 
ditions; and second, we have adapted the assay for 
high throughput by using a robotic workstation. 
This approach has been used in the framework 
of the consortium Bl to analyse the first 100 
deletants furnished by EUROSCARF (European 
Saccharomyces Cerevisiae Archives for Functional 
Analysis; deletants 10 001-10 096). In addition to 
various stress conditions and inhibitors listed in 
the preceding publication, we have analysed- the 
inhibitory activity of more than 300 new chemicals 
on yeast, which in many cases have never been 
tested on S. cerevisiae (Rieger et ai 9 1997, 1998). 

The method used gives a large number of phe- 
notypes, which will be referred to henceforth as 
'chemotypes'. We have found that over 50% of the 
deletants listed in the MIPS/EUROSCARF data- 
base as having no detectable phenotype, do have 
significantly altered growth characteristics when 
exposed to one or more of the compounds/ 
conditions in our standard set. 



MATERIALS AND METHODS 



Yeast strains 



All mutant strains generated during the EURO- 
FAN project were provided by EUROSCARF 
stock center (Frankfurt, Germany; for address and 
persons to contact, see Acknowledgements). AH 
targeted gene deletions were carried out in the 
strain FY 1679 (deletants FY 1 000 1A/B- 10096 
A/B). For the primary phenotype screening, only 
mutants with the alpha mating type were used (FY 
10001 B-10096 B). As reference strain we utilized 
FY1679-18B (Baudin et a/., 1993; Rieger et al„ 
1997). 'Hits' (sensitivity or resistance to a given 
compound) were screened a second time against an 
independent set of the same mutants to eliminate 
false positives. As a further check, we tested ran- 
dom 'hits' discovered by our approach by showing 
co-segregation of the chemotype with the deletion 
marker (tetrad analysis) and complementation by 
the corresponding wild-type gene. 

Analysed strains were as follows: YBL047c/ 
YBR008c, YBR014c, YBR016w, YBR041w, 
YBR042c, YBR043c, YBR161w, YBR162c, 
YBR175w, YBR182c, YBR180w, YBR260c, 
YBR264c//YDL074, YDL117w, YDL223c, 
YDL237w, YDL238c, YDL239c//YGL186c/ 
YGR136w, YGR221c, YGR223c, YGR224w, 
YGR225w, YGR226c, YGR231c, YGR232w// 



YJL038c, YJL045w, YJL046w, YJL047c, 

YJL048c, YJL049w, YJLOSlw, YJL055w, 

YJL056c, YJL057c, YJL058c, YJL059w, 

YJL062w, YJL065c, YJL169w, YJL199c, 

YJL201w, YJL204c, YJL206c, YJL207c, 

YJL213w//YLR036c, YLR042c//YNL054w, 

YNL056w, YNL058c, YNL063w, YNL065w, 

YNL066w, YNL196c, YNL200c, YNL206c, 

YNL208w, YNL211c, YNL212w, YNL213c, 

YNL214w, YNL215w, YNL217w, YNL218w, 

YNL281w, YNL283c, YNL285w, YNL286w, 

YNL288w//YOL018c, YOL087c, YOL088c, 

YOL09 1 w, YOL098c, YOL1 0 1 c, YOL 1 04c, 
YOL105c, YOL132w/YOR311c, YOR322c. 

Chemicals 

Stock solutions of the different compounds were 
made mainly in dimethylsulphoxide (DMSO), 
NaOH and water (some in ethanol, methanol). 
Stock solutions were filter-sterilized and stored 
following the instructions of the suppliers (Alexis, 
Calbiochem, Research Biochemicals International, 
Sigma, Tocris Cookson), All of the chemicals 
were of the highest available purity grade. Final 
concentrations are given in brackets: 

002, ethylenediamine-tetraacetic acid (EDTA) 
[0-7/0-6 mM]; 004, tungstic acid [35/27*5 mM]; 006, 
CdCl 2 [55/50 um]; 007, CsCl [100/90 mM]; 008, 
CoCl 2 [0-6/0-5 mM]; 009, CuS0 4 [9/8-75/6-75 mM]; 
010, NiCl 2 [1-25/1-15 mM]; 017, MgCl 2 [490 mM]; 
020, RbCl [500/300 mM]; 021, SrCl 2 [250 mM]; 022, 
LiCl [160/150 mM]; 023, MnCl 2 [3-5/3 mM]; 024, 
ZnCl 2 [3-5/3 mM]; 032, nalidixic acid [0-6/0-5 mg/ 
ml]; 040, sodium-o- vanadate [2-6/2*3 mM in 50 mM 
KOH; tested with 704 as growth medium]; 044, 2,2 
dipyridyl [10/9/7*5 ug/ml]; 050, cycloheximide 
[0- 18/0* 16/0 14 ug/ml]; 054, benomyl [25/20 ug/ml; 
tested on Petri dishes]; 056, 1,10 phenanthroline 
[15/12*5 ug/ml]; 070, fluorescent brightener 28 
[2/1*5/1 mg/ml; tested on Petri dishes]; 073, 
/7-chloromercuribenzoic acid (PCMB) [0-24/0*22/ 
0*21 mM]; 078, diltiazem-HCl [70/65/60 ug/ml]; 
092, benzamidine [33/31 mM]; 098, chlorpromazine 
[60/50/40/30 um in DMSO]; 100, quercetin [5/3 mM, 
in ethanol]; 146, valeryl salicylate [2*75/2*5 mM in 
DMSO]; 148, loperamide [1-4/1-2/1 mM in DMSO]; 
150, 6-dimethylaminopurine [7 mM in DMSO]; 
156, 2,4 dinitrophenol [1*3/1*1 mM in DMSO]; 160, 
sanguinarine [0*15/0-13 mM in DMSO]; 170, tri- 
methoprim [3-5 mM in DMSO]; 180, tetrindole 
mesylate [9/8-5/8/7 um in DMSO]; 222, nifedipine 
[2*25 mM in DMSO]; 224, nordihydroguaiaretic 
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Figure 1. Chemotyping of S. cerevisiae deletants in microtitre plates. Shown are representative examples of mutants provided by , 
the EUROFAN project. Preparation of medium, inhibitors, cell suspensions and plate filling were done as outlined in Materials 
and Methods. Each microtitre plate was seeded with one dcletant and analysed in 60 different growth conditions (corresponding 
to 3 1 inhibitors at one or several different concentrations in the examples shown above) per plate. The outer rows of wells were 
filled up with agar but were not used for chemotyping. The bottom line in each plate (2H-1 1H) corresponds to 10 replicas of the 
deletant in the absence of inhibitor ('internal control*). Microtitre plates with control or deleted strains (all deletions were made in 
a FY1679 background and all strains were MATa) were incubated for various periods of time at 28°C. Mutant strains were as 
follows: (A) YNL215w; (B) YOR322c; (C) YJL204c. Panel (D) refers to the position of the corresponding inhibitors in each 
microtitre plate (compounds that are not listed in Materials and Methods: zaprinast [120], /-butylhydroquinone [124], disulphiram 
[136], mycophenolic acid [162], N„ N 12 -dicthylspermine tet r any drochlo ride [320], 2,5 anhydro-D-mannito! [330], N-ethylmaleimide 
[344]. benscrazidc [400], 1-0-hcxyl-rac-glyccrol [464]). Photographs were taken after 2 days of incubation at 28"C. 



acid (NDGA) [0*8 mM in DMSO]; 246, canthard- 
ate [0-5 mM in DMSO]; 276, ruthenium red [2*5 mM 
in DMSO]; 282, phenyl-ethylamine [0-40/0*035/ 
0-03% in DMSO]; 310, thiabendazole [100/80 ug/ 
ml in DMSO; tested on Petri dishes]; 334, 
doxorubicin [01/0 08 mM]; 348, diethyl nialeate 
[0-1 5/0- 14/0- 13% in ethanol]; 356, methyl caffeate 
[2-75/2*5 niM in DMSO]; 362, polymyxin B [0*11/ 
0*1 mM]; 378, latrunculin B [0*03/0*0275 mM in 
DMSO]; 390, 4-aminopyridine [6/5 mM in DMSO]; 
418, (3-chloro-L-alanine [9 mM in DMSO]; 436, 
azaserine [0-28/0 -2 6/0 -24 mM]; 438, BAPTA- 
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tetrapotassium salt [1*25/1 0 mM]; 444, guanosine 
5'-0-(2-thiodiphosphate) [0-7 mM in DMSO]; 450, 
diphenyleneiodonium [0*15/0*13 mM in DMSO]; 
454, hexadecylphosphocholine [2/1-5/1 (iM in 
DMSO]; 474, compound 48/80 [0*7/0*6 mM]; 476, 
hydroxyurea [10mg/ml]; 484, caffeine [14/13/ 
10 mM]; 486, tunicamycine [014/0*12% in DMSO]; 
578, daunorubicin [0 1265 mM in DMSO]; 592, 
cerulenin [2-25/2/1*75 |iM in DMSO]; 598, isopro- 
pyl (7V-3-chloro-phenyl)-carbamate (IPCPC) [0*4/ 
0*2 mM in DMSO]; 701, tetraethylammonium 
chloride [55/40/20 jig/ml]. 
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Table 1. Selective list of chemicals and growth conditions for the phenotypic analysis of genes of unknown 
function from EUROSCARF (for strains, see Materials and Methods). Numbers refer to the preparation of the 
corresponding media as outlined in Materials and Methods. Literature quotations are non-exhaustive — for 
simplicity we often refer to the information provided by the supplier (Alexis, 1997) — and indicated only for 
substances that have not been described earlier (Rieger et al, 1997). 



Compound 



Function and/or target and mode of function of inhibitors 



004 Tungstic acid 



092 Benzamidine 
098 Chlorpromazine 



100 Quercetin 



146 Valeryl salicylate 
148 Loperamide 



150 6-Dimethylaminopurine 

160 Sanguinarine 
170 Trimethoprim 
180 Tetrindole mesylate 
222 Nifedipine 

224 Nordihydroguaiaretic acid 

246 Canthardate 
276 Ruthenium red 



282 Phenylethylamine 

310 Thiabendazole 
334 Doxorubicin 



348 Diethyl maleate 
356 Methyl caffeate 

362 Polymyxin B 



Potent inhibitor of xanthine oxidase, inhibits sulfite oxidase in rats 
(Johnson et al, 1974); competitive antagonist of molybdenum in 
animals; phosphotyrosine phosphatase (PTP) inhibitor (mouse); 
inhibitor of acid phosphatase (human) 
Peptidase inhibitor 

Inhibitor of calmodulin stimulation of cyclic nucleotide phosphodi- 
esterase; dopamine antagonist antiemetic and antipsychotic; also 
acts as peripheral vasodilator; inhibits TNF-a production; potent 
PLA 2 inhibitor, inhibits nitric oxide synthase in mouse brain and 
prevents lipopolysaccharide induction of nitric oxide synthase in 
murine lung (Alexis, 1997); counteracts alpha factor-induced 
growth arrest in G! of the S. cerevisiae cell cycle (Ruiz and 
Rodriguez, 1986); acts as photomutagen (Chetelat et al, 1993) 
Antioxidant flavonoid; inhibitor of mitochondrial ATPase, cAMP- 
and cGMP phosphodiesterases; inhibitor of protein tyrosine 
kinases and protein kinase C; induces apoptosis in K562, Molt-4, 
Raji and MCAS tumour cell lines (Alexis, 1997) 
Selective inhibitor of COX-1 (Alexis, 1997) 
Ca 2+ -channel antagonist; binds to opioid receptors; has anti- 
diarrhoeal activity at low concentrations; binds to calmodulin at 
high concentrations; meperidine analogue (Alexis, 1997) 
Non-selective inhibitor of cdc2 and other protein kinases (Alexis, 
1997; Chong et al, 1995; Felix et al, 1989) 
Na + /K + - and Mg 2+ -ATPase inhibitor (Alexis, 1997) 
Inhibitor of dihydrofolate reductase (Bertani and Campbell, 1994) 
Novel antidepressant; selective monoamine oxidase A inhibitor 
Calcium channel blocker (L-type); vasodilator (Alexis, 1997) . 
Antioxidant; lipoxygenase inhibitor (Alexis, 1997; Jensen et al, 
1992; Kustimur et al, 1997) 

High potency inhibitor of protein phosphatase 2A (Alexis, 1997) 
Capsaicin and calcium antagonist; inhibitor of Ca 24 7Mg 2 + ATPase 
(Alexis, 1997); channel blocker (Calvert and. Sanders, 1995); affects 
wall morphogenesis in yeasts (Poli et al, 1995) 
Potentiates dopamine and noradrenaline function in the central 
nervous system (Yu, 1994) 

Microtubule depolymerizing drug (Castano et al, 1996) 
Antitumour antibiotic; inhibitor of reverse transcriptase and RNA 
polymerase; immunosuppressive agent; highly effective myotoxin 
that inhibits topoisomerase II; binds to nucleic acids, presumably 
by specific intercalation into the DNA double helix, thereby 
inhibiting nucleic acid synthesis; induces apoptosis (Alexis, 1997; 
Kule et al, 1994; Patel et al., 1997) 
Generates oxidative stress (Kuge et al, 1997) 
Inhibitor of ornithine decarboxylase and protein tyrosine kinase 
(Alexis, 1997) 

Inhibitor of protein kinase C; antibiotic; breaks bacterial 
membranes by incorporating into the phospholipid of the outer 
membrane and activating phospholipase (Alexis, 1997; 
Boguslawski, 1992) 
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Compound 



Function and/or target and mode of function of inhibitors 



378 Latrunculin B 



390 4-Aminopyridine 

418 p-Chloro-L-alanine 
436 Azaserine 

438 BAPTA 

444 Guanosine 5'-0-(2-thiodiphosphate) 
450 Diphenyleneiodonium 



454 Hexadecylphosphocholine 



474 Compound 48/80 



578 Daunorubicin 



598 Isopropyl (yV-3-chIorophenyl)-carbamate 



Structurally unique marine toxin; actin filament modulator; 
10-100-fold more potent than cytochalasins; whereas cytochalasin 
D induces dissolution of F-actin and stress fibre contraction in 
fibroblasts in culture, latrunculin B causes a shortening and 
thickening of stress fibres. These differences may indicate that the 
two .classes of compounds have distinct target sites (Alexis, 1997) 
Induces depolarization of GABA neurons; prolongs action 
potential in demyelinated nerve fibres (Alexis, 1997) 
Inhibitor of alanine aminotransferase (Morino et al, 1979) 
Potent inhibitor of purine biosynthesis; glutamine antagonist 
(Aronow et al, 1986; Becker and Kim, 1987) 
Highly selective calcium chelating agent (Alexis, 1997; Loukin and 
Kung, 1995) 

Non-hydolyzable GDP-analogue that competitively inhibits 
G-protein activation by GTP and GTP analogues (Alexis, 1997) 
Binds strongly to flavoproteins and is thus a powerful and specific 
inhibitor of several important enzymes, including nitric oxide 
synthase, NADH-ubiquinone oxidoreductase, NADPH oxidases 
and NADPH cytochrome P450 oxidoreductase; recently, nitric 
oxide synthase, which shows significant homology with cytochrome 
P450 reductase, has shown to be irreversibly inhibited by this 
compound (Alexis, 1997; Lesuisse et al., 1996) 
Potent antitumour agent; inhibits protein kinase C and 
phosphatidylcholine biosynthesis; co-stimulates human T cell 
activation (Alexis, 1997) 

Blocks calmodulin and human platelet phospholipase C; blocks 
ADP ribosylation; activates G-proteins (likely G x proteins) in a 
manner similar to G protein -coupled receptors; modulates human 
platelet phospholipase A 2 in a concentration-dependent, biphasic 
manner; stimulates release of histamine and arachidonic acid 
(Alexis, 1997) 

Potent anticancer agent whose potential target site may be 
mitochondrial cytochrome C oxidase; inhibits RNA and DNA 
synthesis; also inhibits eukaryotic topoisomerase I and II; induces 
DNA single-strand breaks and apoptosis in HeLaS3 tumor cells 
(Alexis, 1997; Rule era/., 1994) 
Mitotic poison, inhibits plant metabolism 



Non-standard culture conditions 

702, slow-growth (three growth temperatures: 
16, 28, 36°C; tested on YPD); 703, pH 3*41/3*8/ 
4-16 (concentrated YPD [90% of final volume] was 
mixed at about 60°C with filter-sterilized x 10 
acetate buffer (1 m) of pH as indicated); 704, N3 
(standard glycerol medium: 1% yeast extract, 1% 
bactopeptone, 2% glycerol, 0 05 m sodium phos- 
phate (pH 6-2, 100 ml/1) and 80 mg/1 adenine); 705, 
pH 7-8/8 (concentrated YPD [90% of final volume] 
was mixed at about 60°C with filter-sterilized x 10 
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phosphate buffer [K 2 HP0 4 /K H 2 P0 4 , 1 m] of pH as 
indicated). : 

Establishing the range of inhibitor concentrations 
of the reference strain 

To establish the threshold concentration (or a 
range of concentrations) for the reference strain, 
compounds were serially ( x 4) diluted in 20% 
DM SO in microtitre plates. Where necessary, the 
final inhibitor concentration was determined more 
precisely by testing concentrations between the 
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last dilution (e.g. x 4) that completely inhibited 
growth and the next serial dilution (e.g. x 16) 
allowing growth. In all cases plates were filled with 
compounds (20ul/well) followed by distribution 
of YPD medium (180ul/well, 0*7% agar). Plates 
were stored for at least 24 h at 4°C to allow 
chemicals to diffuse into the agar prior to dispen- 
sion of the control-cells (FY1679-18B). The 
reference strain was cultured overnight in liquid 
YPD at 28°C, dilutions were made in YPD -and 
about 1000 cells/well (20 uJ) were spotted in the 
wells. Cell growth was then followed for up to 
7 days in 28°C in the presence or absence of the 
compounds. 

Media, replication of compound arrays by a 
robotic workstation and cell culture conditions 

The majority of assays was done with standard 
complete glucose medium (YPD, 1% yeast extract 
[Difco Laboratories, Detroit, USA], l% bactopep- 
tone [Difco], 2% glucose, 20 mg/1 adenine; growth 
medium was solidified by adding 0-7% Bacto-Agar 
[Difco]). In some cases chemo types were scored on 
Petri dishes (e.g. growth conditions 54, 310, 703, 
705). In this case, medium was solidified by adding 
,2% Bacto-Agar. 

Compounds arrays were replicated at 20 ul per 
well into sterile 96-well microtitre plates (NUNC, 
Intermed, Polylabo, Switzerland) from larger vol- 
umes held in 3 ml deep-(96)-well 'master* plates 
(Quiagen). This operation was carried out using a 
Genesis RSP100 pipetting robot (TEC AN AG, 
Switzerland). Sterile 96-well microtitreplates and 
'master' plates were stored with loose lids on a 
high capacity storage carousel. Plates were trans- 
ferred between the carousel and the pipettor 
using a CRS A255 robotic arm (CRS, Canada). 
Movements were programmed using CLARA 
dynamic scheduling software (SCITEC, Lausanne, 
Switzerland). Each of these 'master' plates con- 
tained sufficient volumes of compound solutions to 
produce up to 54 replicated plates. The peripheral 
wells were left empty, to avoid previously observed 
'edge' effects. Hot growth medium (YPD, contain- 
ing 0*7% agar) was subsequently dispensed at 
180 jil per well, into all 96 wells of each of the 
replicated compounds plates using a Multidrop- 
831 eight-channel dispenser (Labsystems OY, 
Finland). This was carried out as a manual batch 
operation, using a custom-modified delivery tube 
assembly to deliver agar at 50-65°C approxi- 
mately. Tubing was sterilized with ethanol and 
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rinsed with sterile deionized water immediately 
before priming with the sterile agar solution. Plates 
were then loose-lidded' and allowed to cool at 4°C 
for at least 1 day in a horizontal position before 
manually dispensing yeast cell suspensions using a 
handheld semiautomatic multichannel pipette 
EDP-Plus M8 (Rainin Instrument Company, Inc., 
Woburn, MA). 

Strains were cultured overnight in liquid YPD at 
28°C. On the next day, the optical density of all 
cultures was determined and cell density adjusted 
to 5 x 10 000 cells/nil by dilution into fresh YPD. 
For the seeding of cells in microtitreplates, cell 
suspensions were vortexed vigorously and filled at 
about 1 ml in the tubes of a cluster tube 8-strip 
rack (Costar, Polylabo, Paris). 20 ul of the cell 
suspension, corresponding to about 1000 cells/well 
were inoculated into the wells and the microtitre 
plate placed on a shaker for 5 s in order to cover 
the agar surface uniformly. Plates were then incu- 
bated at 28°C for up to 10 days. From the first day 
of incubation on, growth of the mutant strains was 
scored visually, either directly on the plate or later 
on photographs, by comparison with the growth 
of the corresponding control strains. 



RESULTS AND DISCUSSION 

In the present paper we describe the systematic 
phenotypic analysis of 85 yeast mutants using a • 
robotic workstation in combination with a battery 
of some 300 growth inhibitory conditions. 
Mutants as a starting material for functional 
analysis were generated during the EUR OF AN 1 
programme and have been obtained from EURO- 
SCARF. These mutants, which had no obvious 
phenotypes (as tested by growth on standard 
media: YPD, respiratory medium, different tem- 
peratures), were therefore analysed by screening 
them against a collection of chemical compounds. 
This library contained beside the growth con- 
ditions described earlier (Rieger et ai, 1997) a 
large number of 'novel* inhibitors, that have not 
been reported previously in the literature to inter- 
fere with cell growth of Saccharomyces cerevisiae. 
Some representative examples of the screening are 
given in Figure 1. Among the compounds, only 
those that gave a clear and stringent phenotype 
with the analysed set of mutants are presented 
here (Table 1). Many novel inhibitors (e.g. iso- 
propyl A^3-chlorophenyl)-carbamate, loperamide, 
sanguinarine, chlorpromazine, ruthenium red, 
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NUMBER OF CHEMICALS/CONDITIONS 

Figure 2. Distribution of the number of deletants as a function of the number of 
chemotypes. For each deletant the number of chemotypes was determined (see Table 2): 
e.g. YGR224w had one chemotype (hypersensitivity to polymyxin B) while YOL088c 
had three chemotypes (hypersensitivity to CdCl 2 and sensitivity to cerulenin and 
loperamide). The deletants were grouped into classes displaying 1 or 2 chemotypes, 3 or 
4 chemotypes, etc. and the number of deletants in each class is shown. 



azaserine, guanosine 5'-0-2-(thiodiphosphate), 
etc.) turned out to be highly discriminating, since in 
many instances specific deletants were much more 
sensitive to the drug than the reference strain. 

Interestingly, as shown in Figure 2, the deletants 
can be grouped in two classes. In the first class, the 
majority of the mutants displayed sensitivity to a 
small number of inhibitors/chemicals: most of 
them were sensitive to one or two inhibitors tested 
(e.g. Y0R31 Ic, YGR224w, YBR043c; see Table 2) 
while progessively fewer mutants were sensitive to 
a larger number of compounds tested. Obviously 
(see Table 2), the chemicals to which those mono-/ 
bitropic deletants are sensitive are characteristic 
and specific for each mutant (e.g. YBR043c is 
hypersensitive to MnCl 2 , while YNL058c is sen- 
sitive to loperamide). It is possible that this class 
comprises genes which are involved in rather 
specific cellular processes. In addition their 
deletion does not severely interfere with the 'gen- 
eral health' of yeast. On the other hand, the second 

m 
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class, which is clearly distinct from the first, dis- 
plays highly pleiotropic phenotypes, being sen- 
sitive to at least 11 different inhibitors/chemicals 
(in one extreme case, the mutant [YJL204c] was 
sensitive to more than 20 inhibitors/chemicals). It 
is plausible that the latter class comprises genes 
involved in some major cellular processes which 
hierarchically control several interconnected path- 
ways like signal transduction, regulation of tran- 
scription, stress response or elaboration of cell 
architecture. Such multiple effects are understand- 
able, since the primary defect can lead by a cascade 
of successive interactions to a variety of end effects 
which are indicative of the importance of the 
deleted gene. The panoply of end effects can be 
considered as 'symptoms' of the mutation and they 
can be grouped into 'syndromes' diagnostic of 
the biochemical/physiological process which is 
affected. This kind of analysis is beyond the scope 
of this article, which is essentially methodological, 
and focuses simply on the discovery of novel and 
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Table 2. List of chemotypes of gene deletions provided by EUROSCARF (Nos 10 001-10 096). 



Phenotypes (this study): 
ORF/gene name: comments hypersensitivity [HS]; sensitivity [S]; 

(MIPS/SGD/YPD) - resistance [R] 



YOL088c/MPD2 



YOL087c 



YOR322c 



YOR311c 
YGR224w 

YBR043c 



YNL066w/SUN4 



YNL063w 

YNL058c 
YNL056w 



YNL054w/VAC7 



Protein of unknown function; weak 
similarity to disulphide isomerases and 
ER60 proteases; has ER-targeting 
sequence (Zumstein et al. y 1995); 
overproduction suppresses lethality of 
protein disulphide isornerase depletion 
(Tachikawa et al , 1997) 
Protein of unknown function; has one 
WD (WD-40) domain; similarity to 
S. pombe hypothetical protein 

Protein of unknown function; null mutant 
shows osmotic sensitivity on YEPD at 
37 e C (Pearson et al, 1998); similarity to 
hypothetical S. pombe protein 
SPAC1F2.05 

Protein of unknown function; similarity to 
& pombe hypothetical protein 
Strong similarity to drug resistance protein 
SGE1> belongs to cluster II of the 
MFS-MDR family (Goffeau et al, 1997) 
Similarity to benomyl/methotrexate 
resistance protein; belongs to cluster I of 
the MFS-MDR family (GofTeau et a/., 
1997) 

Involved in the ageing process [Mutants 
have longer lifespan and better viability 
upon starvation (Austriaco, 1996)]; strong 
similarity to YIL123w t UthJp, Nca3p and 
C. wickerhamii [J-glucosidase protein 
Protein, of unknown function; weak 
similarity to Mycoplasma 
protoporphyrinogen oxidase 
Protein of unknown function; similarity to 
YlLUlc 

Protein of unknown function; similarity to 
YNL099c and SIW14 (protein tyrosine 
phosphatase; null mutant fails to show cell 
cycle arrest upon nutritient starvation, is 
sensitive to 5 mM caffeine and 1 M NaCl 
and shows delocalized actin upon nutrient 
starvation 

Integral vacuolar membrane protein, 
involved in vacuole morphology and 
inheritance (Bonangelino et al., 1997) 



Cadmium chloride [HS] (6), cerulenin [S] 
(592), Loperamide [S] (148) 



1,10 phenanthroline [HS] (56), sanguinarine 
[HS] (160), NDGA [HS] (224), IPCPC [S] 
(598), gunaosine 5'-0-(2-thiodiphosphate) [S] 
(444) 

EDTA [S] (2), cadmium chloride [HS] (6), 
cycloheximide [HS] (50), 
6-dimethyIaminopurine [HS] (150), methyl 
caffeate [HS] (356), P-chioro-L-alanine [S] 
(418), caffeine [HS] (484), cerulenin [S] (592) 
Nickel chloride [R] (10), nalidixic acid [R] 
(32) 

Polymyxin B [HS] (362) 



Manganese chloride [HS] (23) 



Polymyxin B [S] (362), azaserine [S] (436) 



Canthardate [S] (246), polymyxin B [S] (362), 
tetraethylammonium chloride [S] (701) 

Loperamide [S] (148) 

Diltiazem-HCl [S] (78), caffeine [HS] (484) 



EDTA [S] (2), copper sulphate [S] (9), 
cycloheximide [HS] (50), benomyl [HS] (54), 
fluorescent brightener [S] (70), azaserine [HS] 
(436), caffeine [HS] (484), slow-growth (702) 
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Table 2. Continued. 



ORF/gene name: comments 
(MIPS/SGD/YPD) 



Phenotypes (this study): 
hypersensitivity [HS]; sensitivity.[S]; 
resistance [R] 



YBR260c 



YNL215w 



YNL214w/PEXI7 



YBR264c/YPT10 

YNL213c 
YJL207c 



YJL204c 



YOL018c/TLG2 



YNL196C/SLZ1 



Protein of unknown function; similarity to 
C. elegans GTPase-activating protein, null 
mutant has slightly decreased viability 
during stationary phase; interacts 
genetically with SLGJ (plasma membrane 
protein required for maintenance of cell 
wall integrity and for the stress response) 
Protein of unknown function; similarity to 
hypothetical S. pombe protein 

Component of the peroxisomal protein 
translocation-peroxin; null mutants lack 
morphologically detectable peroxisomes 
(Huhsee/a/., 1998) 

Protein of unknown function; protein with 
similarity to rab proteins and other small 
GTP-binding proteins (Du et al., 1998) 
Protein of unknown function; lysine-rich 

Protein of unknown function; weak 
similarity to rat omega-conotoxin-sensitive 
calcium channel alpha- 1 subunit rbB-I; 
weak similarity to Spc98p 
Protein of unknown function; weak 
similarity to TOR2p bud lacks consensus 
sequence for lipid kinases; member of a 
family of glycosyl hydrolases 



Member of the syntaxin family of 
t-SNAREs; mutants show endocytosis 
defect and loss of Kex2p\ affects late Golgi 
compartment (Holthuis et al t 1998) 



Sporulation specific protein, regulated 
by the transcription factor Ume6 and 
expressed early in meiosis 



Cadmium chloride [HS] (6), diltiazem-HCl 
[S] (78), tunicamycine [HS] (486) 



Cycloheximide [HS] (50), quercetine [S] (100), 
azaserine [S] (436), hydroxyurea [S] (476), 
IPCPC [S] (598), slow-growth (702) 
doxorubicin [S] (334), compound 48/80 [S] 
(474) daunorubicin [S] (578) 



Chlorpromazine [S] (98), tetraethylammonium 
chloride [S] (701), pH 3-41/4-16 [S] (703) 

Extreme slow growth according to 
EUROSCARF [HS] (702), pH 3-8 [S] (703) 
Diltiazem-HCl [S] (78), chlorpromazine [HS] 
(98) 



Cesium chloride [HS] (7), cobalt chloride 
[HS] (8), rubidium chloride [S] (20), zinc 
chloride [HS] (24), cycloheximide [HS] (50), 
PCMB [HS] (73), diltiazem-HCl [HS] (78), 
chlorpromazine [HS] (98), quercetin [S] (100), 
loperamide [HS] (148), trimethoprim [HS] 
(170), tetrindole mesylate [S] (180), nifedipine 
[S] (222), NDGA [S] (224), ruthenium red 
[HS] (276), latrunculin B [S] (378), polymyxin 
B [HS] (362), p-chloro-L-alanine [S] (418), 
guanosine 5-0-(2-thiodiphosphate) [S] (444), 
diphenyleneiodonium [S] (450), slow-growth 
[S] (702), pH 4- 16-sensitive [S] (703), 
pH 7-8/8-sensitive [HS] (705) 
EDTA [HS] (2), cesium chloride [HS] (7), 
rubidium chloride [HS] (20), strontium 
chloride [HS] (21), lithium chloride [S] (22), 
cycloheximide [HS] (50), diltiazem-HCl [HS] 
(78), chlorpromazine [HS] (98), loperamide 
[HS] (148), NDGA [HS] (224), ruthenium red 
[HS] (276), latrunculin B [S] (378), BAPTA 
[S] (438), caffeine [HS] (484) 
Loperamide [HS] (148) 
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Table 2. Continued, 



ORF/gene name: comments 
.(MIPS/SGD/YPD) 



Phenotypes (this study): 
hypersensitivity [HS); sensitivity [S]; 
resistance [R] 



YOL105c/WSC3 



YNL200c 



YNL206c 



YNL21 7w 



YGR226c 

YGR221c 

YNL21 lc 
YGR231c/PHB2 



YDL117w 

YGR136w 
YBL047c 



YDL074c 



Cell wall integrity and stress response 
component 3; slg] (wsc /)-null vwcJ-null 
double mutant shows a lysis defect on 
YPD at room temperature and heat shock 
sensitivity (Verna et al^ 1997) 
Protein of unknown function; strong 
similarity to human TGR-CL10C 
(thyroidal receptor for 
tf-acetylglucosamine); contains a possible 
secretory signal; belongs to a group of 16 
genes that are coordinately induced early 
in the diauxic shift (De Risi et ai, 1997) 
Protein of unknown function; protein with 
similarity to SSRP proteins (DNA 
structure-specific recognition protein); 
contains one PEST-site 



Canthardate [S] (246) 



Protein of unknown function; weak 
similarity to E. coli bis (5'-nucleosyl)- 
tetraphosphatase; contains a prokaryotic 
membrane lipoprotein lipid attachment 
site 

Protein of unknown function 

Protein of unknown function; similarity to 
hypothetical protein YHR149c 
Protein of unknown function 
Protein required for normal life-span; 
strong similarity to S. cerevisiae PHBlp, 
mammalian prohibitins and mouse B-cell 
receptor-associated protein BAP37 (Coates 
et ai, 1997) 

Protein of unknown function; similarity to 
hypothetical S. pombe protein 

Protein of unknown function; has 
similarity to YPR154w 
Protein with similarity to cytoskeletal 
protein Usolp, Panlp and mouse tyrosine 
kinase substrate EPS 15; contains EF-hand 
calcium-binding domain and EH domain 
(Tang and Cai, 1996; Wendland and Emr, 
1998) 

Protein of unknown function; weak 
similarity to spindle pole body protein 
NUF1 



Polymyxin B [HS] (362), azaserine [S] (436) 



Cesium chloride [S] (7), zinc chloride [S] (24), 
sodium-0-vanadate [R] (40), cycloheximide 
[HS] (50), benzamidine [S] (92), loperamide 
[HS] (148), sanguinarine [HS] (160), 
ruthenium red [S] (276), thiabendazole [HS] 
(310), azaserine [S] (436), caffeine [HS] (484), 
IPCPC [HS] (598), slow-growth [S] (702), 
pH4-16 [S] (703) 

Azaserine [HS] (436), sodium- 0-vanadate [R] 
(40) 



Cycloheximide [HS] (50), diltiazem-HCl [S] 
(78), benzamidine [HS] (92), pH 4-16 [S] (703) 
Guanosine-5-0-(2-thiodiphosphate) [S] (444) 

Sodium-O-vanadate [R] (40) 
Loperamide [S] (148) 



Magnesium chloride [S] (17), 1,10 
phenanthroline [S] (56), loperamide [HS] 
(148), azaserine [HS] (436) 
EDTA [S] (2) 

Manganese chloride [HS] (23), diltiazem-HCl 
[S] (78), caffeine [HS] (484) 



Cycloheximide [HS] (50), benomyl [R] (54), 
PCMB [S] (73), chlorpromazine [HS] (98), 
canthardate [S] (246), diphenyleneiodonium 
[S] (450) 
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Table 2. Continued. 



ORF/gene name: comments 
(MIPS/SGD/YPD) ; 



Phenotypes (this study): 
hypersensitivity [HS]; sensitivity [S]; 
resistance [R] 



YBR014c 

YBR 1 62c 
YBR175w 



YJL051w 



Protein of unknown function; similarity 
to glutaredoxin and strong similarity to 
YDLOWw 

Protein of unknown function; similarity to 
hypothetical protein YJL17Jc and AgaJp 
Protein of unknown function; has WD 
(WD-40) repeats; weak similarity to 
GTP-binding proteins, TupJp, Pwp2p and 
human LIS- 1 

Protein of unknown function 



YJL059W/BTN1/YHC3 



YJL049w 
YJL058c 



Protein of unknown function; similarity 
to human Batten disease-related protein 
CLN3, not required for mitochondrial 
function or degradation of mitochondrial 
ATP synthase (Pearce and Sherman, 1997) 
Protein of unknown function 
Protein of unknown function; strong 
similarity to hypothetical protein 
YBR270c 



YJL056c/ZAPl 



YJL057c/IKSl 



YJL047c 



YJL055w 



YJL046w 
YJL045w 



Metalloregulatory protein involved in 
zinc-responsive transcriptional regulation; 
plays a central role in zinc ion homeostasis 
by regulation transcription of the zinc 
uptake system genes in response to zinc; 
null mutant grows poorly on zinc-limiting 
media (Zhao and Eide, 1 997) 
Probable serine/threonine kinase; null 
mutant is heat shock-sensitive; iral kinase 
suppressor (Matviw et ai t 1993) 
Protein of unknown function; weak 
similarity to cdc53p and similarity to 
clathrin heavy chain in one domain 
Protein of unknown function; similarity 
to R. fascians hypothetical protein 6; has 
similarity to P. aeroginosa hypothetical 
protein in azu region 
Protein of unknown function; similarity 
to E. coli lipoate-protein ligase A 
Protein of unknown function; similarity 
to succinate dehydrogenase flavoprotein 



Diltiazem-HCl [S] (78) 



EDTA [S] (2), copper sulphate .[H S] (9) 

Chlorpromazine [HS] (98), NDGA [S] 
(224), IPCPC [S] (598) • 



EDTA [S] (2), cadmium chloride [HS] (6), 
benzamidine [HS] (92), valeryl salicylate [S] 
(146), 2,4 dinitrophenol [S] (156), 
sanguinarine [S] (160), phenyl-ethylamine 
[S] (282), 4-aminopyridine [HS] (390), 
hydroxyurea [S] (476), slow-growth (702), 
pH 3-8 [S] (703), N3 [HS] (704) 
Copper sulphate [HS] (9), loperamide [S] 
(148), extreme slow growth according to 
EUROSCARF [HS] (702) 



EDTA [S] (2) 

Tungstic acid [HS] (4), copper sulphate 
[HS] (9), benomyl [HS] (54), diltiazem-HCl 
[S] (78), quercetin [S] (100), 
6-dimethylaminopurine [S] (150), 
thiabendazole [HS] (310), polymyxin B 
[HS] (362), azaserine [S] (436), caffeine [HS] 
(484), pH 4-16 [HS] (703) 
EDTA [HS] (2), diltiazem-HCl [S] (78), . 
loperamide [HS] (148), polymyxin B [S] 
(362), BAPTA [S] (438), IPCPC [S] (598) 



Copper sulphate [HS] (9) 



Thiabendazole [HS] (310), azaserine [HS] 
(436), IPCPC [HS] (598) 

Phenyl-ethylamine [R] (282), polymyxin B 
[R](362) 



Zinc chloride [R] (24), 
hexadecylphosphocholine [R] (454) 
Azaserine [HS] (436) 
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ORF/gene name: comments 
(MIPS/SGD/YPD) 



Phenotypes (this study): 
hypersensitivity [HS]; sensitivity [S]; 
resistance [R] 



YJL065c 



YGL186c 



YOL091w 



YBR042c 



Protein of unknown function; weak 
similarity to DNA-directed DNA 
polymerase II chain C 
Protein of unknown function; member of 
purine/cytosine permease family, a 
subfamily of the major facilitator 
superfamily (MFS) (Nelissen et al y 1997); 
similarity to hypothetical protein Fcy21p 
and weak similarity to FCY2 protein 
Protein of unknown function; homozygous 
diploid null mutant fail to sporulate 
(Pearson et a/., 1998) 

Protein of unknown function; strong 
similarity to hypothetical protein 
YDR0J8c; probable ATPase with 
4 potential transmembrane domains 



2,2-Dipyridil [S] (44), diethyl maleate [R] 
(348) 

Chlorpromazine [S] (98), 4-aminopyridine 
[HS] (390), hydroxyurea [HS] (476), 
slow-growth [S] (702), pH 4-16 [HS] (703) 



Grows much better than the wild-type under 
many different growth conditions, for 
example: 9, 24, 44, 56, 148, 300, 362, 436, 
454, 701, etc. 
Copper sulphate [R] (9) 



stringent phenotypes (note that 'suggestive' pheno- 
types are not reported). In some cases we found 
that highly pleiotropic deletants displayed a slow 
growth phenotype, which could be observed 
already on glucose medium, YPD (e.g. YNL206c). 
However, this is not always the case. On the one 
hand, some extremely slow-growing deletants (like 
YJL059w) did not display a highly pleiotropic 
phenotype, being sensitive to only two com- 
pounds. On the other hand, some extremely 
pleiotropic deletants (like YOL018c) were not 
particularly slow growing on standard media. 
Finally, some hypersensitivity/resistance chemo^ 
types could possibly result from changes . in the cell 
membrane/wall permeability. However, in view of 
the large diversity of compounds used and chemo- 
types observed (see Tables 1 and 2) this explana- 
tion could not be true for the majority of 
chemotypes. It should be added that, in Table 2, 
only 'clear' pheno.types are listed. In addition to 
those mentioned, several 'suggestive' phenotypes 
(see Rieger et aL, 1997 for discussion of this point) 
have been observed. Since they were much weaker, 
reflect subtle variations in growth rate only, and 
frequently were different between the MATa and 
MATa series of deletants, their significance is 
doubtful and they have therefore not been 
presented in this study. 
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It remains to be established in a fully systematic 
manner that the chemotypes we observe 
co-segregate with the deletion; however, we 
have verified this in some cases. For example, 
co-segregation of the hypersensitivity to thiaben- 
dazole and IPCPC with the YJL047c and • 
YNL206c deletions has been observed in tetrad 
analysis and was confirmed by complementation 
with the corresponding cognate wild-type genes. 
Furthermore, we have retested at random 1 2 com- 
pounds and nine mutants of opposite mating type 
(MATa) from the EUROSCARF collection and in 
more than 80% of the cases the deletants were 
found to respond in a manner similar to that 
shown by the MATa series of deletants. 

The method used gives a large number of 
chemotypes, and we have found that over 50% of 
the deletants (which are listed in the MIPS/ 
EUROSCARF database as 'normal' with no 
detectable phenotype) do have a significant growth 
deficiency in one or more of the conditions tested. 

In conclusion, we suggest that the chemotyping 
approach described in this work is able to con- 
tribute to large-scale analysis of unknown gene 
function in several ways: (a) genes can be grouped 
into functional groups based on their observed 
sensitivity or resistance to different compounds; (b) 
important clues to gene function can be inferred 
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from a knowledge of the known pharmacological 
action of the agent which modifies growth of the 
corresponding mutant; and (c) generation, of a 
scoreable phenotype from an otherwise silent 
mutation empowers a genetic approach to study- 
ing gene function. We conclude, therefore, that 
systematic chemotyping of yeast mutants will lead 
to a more precise understanding of the biochemical 
and physiological function for many of the 
'functional orphan' yeast genes. 
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PROTOCOL 

- ■ - " 1 1 t ■ ■ ■ ■ 

Robotic Replication 



1. Label microliter plates and fill with liquid medium as described in steps 1 
and 2 of the Manual Replication protocol (see p. 18). 

Note: The labeling of plates and addition of medium can be performed in a separate area, 
either in a laminar flow hood, if one is available, or on an open bench area using standard 
aseptic techniques. 

2. Set up the robot as follows: - 

a. Fill the sterilization bath with 95% ethanol to a depth greater than the 
depth of culture in the microtiter plates. 

b. Fill the sonication bath with sterile distilled H 2 0 containing 0.1% (v/v) 
decon. 

c. Switch on the air filter. 

d. Attach the appropriate replicating tool (either a 96-pin or 384-pin tool) 
to the moving head. 

e. Clean the pins of the replicating tool by placing it in the sonication bath. 
Sonicate for 3 minutes. 

f. Sterilize the pins by immersion in the ethanol bath for approximately 1 5 
seconds. Air-dry for approximately 10 seconds. 

3. Place the source and recipient plates on the bed of the robot. Remove the 
plate lids just before beginning the replication procedure. 

4. Set up a replicating routine as follows: 

a. Immerse the pins of the replicating tool in the sterilization bath for 
approximately 15 seconds. 

b. Air-dry for approximately 10 seconds. 

c. Immerse the pins in the wells of the first source plate. To avoid any 
splashing of culture, slowly remove the tool from the plate. 

d. Immerse the pins in the wells of the first recipient plate and remove the 
tool. 

Note: The thickness of the replicating pins determines the volume of culture to be transferred 
and therefore influences the rate of growth in the recipient plate. Multiple inoculations may 
be required to achieve sufficient growth at 37°C in 16-20 hours, and this should be tested 
empirically with a small number of plates. 

5. Repeat step 4(a-d) with subsequent plates. 

6. Replace the lids as soon as inoculation of the plate is completed. Freeze the 
source plates as described in step 6 (see p. 17) and incubate recipient plates 
at 37°C overnight as described in step 5 (see p. 17). 
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PROTOCOL 

Recovery of Individual Clones 



The following procedure should be performed in a laminar flow hood if one is 
available. Otherwise, perform the procedure on an open bench area using stan- 
dard aseptic techniques. 

1. Prepare a list of the plates and addresses of the clones to be recovered. 

Note: Organizing this list in accordance with the system used to store the plates in the freez- 
er will speed the process of plate retrieval and reduce the amount of time freezer doors must 
be kept open. 

2. Withdraw only the desired plates from storage at -80°C and place them in 
a tray on a layer of dry ice. 

3. Remove the plate lid. Blot any liquid that may have condensed on the sur- 
face of the plate with sterile, absorbent paper. 

Note: It is important to blot any liquid from the surface of the plate to reduce well-to-well 
contamination in the library, particularly for library copies in 384-well plates (where there is 
no physical barrier to liquid spread by capillarity). Use sterile Whatman 3 MM for this pur- 
pose (sterilized by autoclaving). 

4. Pick a small amount of the frozen culture with a sterile wooden cocktSil 
stick (toothpick) or wooden applicator stick. Streak the culture onto an LB- 
agar plate containing the appropriate antibiotic. ' 

Notes: For appropriate concentrations of antibiotics in culture medium and the preparation 
of stock and working solutions, see Appendix. 

When picking, use a drilling motion with the stick, rather than a levering motion which 
may result in frozen culture being transferred to neighboring wells. 

Some laboratories use a sterile mask with a single -well -size hole (e.g., made from 
Whatman benchkote paper) to avoid cross -contamination between cultures in separate 
wells. 

5. Replace the plate lid and return the plate to its stack in the freezer. Incubate 
. the agar plates at 37°C overnight to allow growth of colonies. 
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Library Screening 

Several effective strategies for screening genomic, 
libraries for clones of interest have been developed. 
These are based on colony hybridization, PCR, or a 
combination of both techniques. Colony hybridiza- 
tion with bacterial clones is extremely robust and 
efficient, typically yielding clear, strong signals even 
when the cloned DNA is present as a single copy 
per cell (e.g., BACs and fosmids; Kim et al. 1994). 
An advantage of screening by hybridization is the 
variety of different types of probes that can be used, 
including those corresponding to single-j^opy 
sequences, repetitive elements, insert end segments 
(e.g., for "walking" efforts), and whole clone inserts 
(e.g., purified YAC DNA, BAC DNA, or PCR prod- 
ucts). The routine of arraying and storing genomic 
libraries in microtiter plates facilitates the produc- 
tion of filters containing colonies at varying densi- 
ties. When robotics are used, colonies can be 
arrayed at a very high density in a highly repro- 
ducible fashion. Approximately 20,000 clones can 
be represented on a 22 x 22-cm filter, and an 8 x 
12-cm filter can hold several thousand clones. 
Alternatively, pools of DNA derived from well- 
defined combinations of microtiter plates or por- 
tions thereof can be generated from arrayed 
libraries and used for PCR-based screening. Such 
pools can be used effectively for rapid and efficient 
screening of the library using specific PCR assays. 
One drawback to PCR-based screening, however, is 



that the amount of effort required for generating 
suitable DNA pools goes up dramatically as the 
number of clones in a library increases. This is par- 
ticularly the case if the screening strategy is 
designed to allow the identification of individual 
positive clones (as opposed to a delimited set of 
clones, such as an individual microtiter plate). As a 
result, a "hybrid" screening strategy using a combi- 
nation of PCR and colony hybridization can be used 
as a compromise for genomic library screening.. 
Specifically, PCR screening is used to identify indi- 
vidual microtiter plates containing positive clones. 
Then, colony hybridization is performed with those 
microtiter plates to identify the precise well posi- 
tions of positive clones (Green and Olson 1990a). 
However, although in the majority of cases the 
clones identified by hybridization of a PCR product 
will be the same as those detected by PCR, the 
specificity of an STS as a hybridization probe is not 
always identical to the specificity of its correspond- 
ing PCR assay (e.g., due to the presence of repeti- 
tive sequences residing between the two PCR 
primers). 

A great majority of investigators perform 
hybridization- or PCR-based screening of genomic 
libraries using hybridization filters or DNA pools 
obtained from central facilities or companies (see 
discussion on pp. 2-3). However, it may on occa- 
sion be necessary to produce filters for hybridiza- 
tion screening or DNA pools for PCR screening in a 
local laboratory. Thus, appropriate protocols are 
provided for each of these in the followirtg sections. 
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Screensavers: trends in 
high-throughput analysis 



Even, and perhaps especially, in the 
post-genomic era, the search for 
new drugs begins with the detection 
and (hopefully) validation of a novel 
target, and with the development of 
an assay for the interaction of an arti- 
ficial agonist or antagonist - and of 
its natural ligand, if known - with 
that target. 

As the race to develop new chemi- 
cal entities with novel pharmacologi- 
cal activity hots up, the major phar- 
maceutical companies have invested 
heavily in automation and robotics, 
and in the high-throughput assaying 
of large compound libraries for phar- 
macological activities against drug 
receptors. The current state of the art 
in this area was the theme of a recent 
international conference*, which 
reflected the intense applied interest 
in this area by the fact that some 98% 
of the 1256 delegates were from 
industrial corporations. 

Cruising the post-genomic 
orphanage 

Both the results of the whole- 
genome-sequencing programmes 
and the contents of the extensive 
expressed sequence tag (EST) data- 
bases have made it abundantly clear 
that many gene products have no 
close relatives in the databases (they 
are 'orphans') and are of unknown 
function, that many are cell- 
membrane receptors (especially 
G-protein-coupled receptors), that 
their natural ligands are also 
unknown, and that they consequently 
represent important novel targets for 
both agonists and antagonists. 

The newly discovered orexins A 
and B and their receptors are a case 
in point (Masashi Yanagisawa, Uni- 
versity of Texas Southwestern Medi- 
cal Center, Dallas, TX, USA), and 
are involved in the control of food 
intake. Direct binding assays are 
possible but measurements based on 
the agonist-dependent production of 
Ca 2+ transients in transfected cells 
provide a more pertinent, functional 

*The 4th Annual Meeting of the Society for 
Biomolecular Screening (http://www.sbsonline. 
org/) was held in Baltimore, MD, USA, 
20-24 September, 1998. 



assay. It is recognized that there 
are some 10 000-40 000 different 
mRNAs in the 500 000 molecules 
expressed in a typical mammalian 
ceD. This means that, of the many 
novel approaches to functional 
genomics, expression profiling of 
these mRNAs using oligonucleotide 
arrays has enormous power for pro- 
viding clues to the function of orphan 
genes (Eugene Brown, Genetics 
Institute, Cambridge, MA, USA). 
Modern systems of this type are lin- 
ear from 0.5 to 500 pM target RNA 
and are reproducible and quantita- 
tive. Such arrays, preferably prepared 
using bacterial artificial chromosomes, 
allow the facile detection of huge 
numbers of single-nucleotide poly- 
morphisms (Janice Kurth, Genset 
Corporation, La Jolla, CA, USA). 
These can be exploited in screening 
patients for their likelihood of acquir- 
ing particular illnesses and for their 
suitability for chronic drug therapy. 

What you see is what you get 
— novel optical methods for 
high-throughput screening 

Fluorescent methods are probably 
the methods of choice for high- 
throughput screening (HTS) assays, 
and their repertoire continues to 
increase. Those based on variants of 
time-resolved fluorescence (which 
allows the discrimination of the sig- 
nal of interest from other fluorescent 
background signals) have particular 
merit, especially as the drive towards 
miniaturization means that, because 
of contributions from the assay plates, 
the signal decreases more quickly 
than the background as the assay vol- 
ume is reduced 0ack Owicki, LJL 
Biosystems, Sunnyvale, CA, USA). 

The current move is away from 
the traditional 96-well plate to 384- 
and especially 1536-well versions, 
where reagent costs are typically 100 
times lower and assay volumes drop 
from some 400 jxl to 5-10 yA 
(Jonathan Burbaum, Pharmacopoeia, 
Princeton, NJ, USA). Technical issues 
become significant here, such as the 
use of conical rather than square wells 
to avoid wi eking and the importance 
of measures to stop evaporation, but 
the great benefit is cost reduction, 



with typical costs for a screening 
campaign being reduced from 
US$35 million to US$1.1 million. 
Best of all is if there are no reagents. . 
The use of infrared spectroscopy in 
HTS is a novel, reagendess and 
generic technique requiring at most 
a few Uvl of sample; as an example 
from titre-improvement programmes, 
the measurement time may be 
reduced to 1 sec from the 15 min 
required for the traditional HPLC 
analysis (Douglas Kell, University of 
Wales, Aberystwyth, UK). 

Classical analysis of optical assays 
in microti tre plates used scanning 
methods in which the results were 
read sequentially by a single detector. 
This represented a substantial botde- 
neck in the speed of the overall 
screening process and thus a major 
trend is towards imaging methods in 
which, by coupling a telecentric 
(non-parallax) lens and a CCD cam- 
era, an entire plate may be imaged 
and read simultaneously (Ronald 
Barrett, Affymax Research Institute, 
Palo Alto, CA, USA; Neil Cook, 
Amersham Pharmacia Biotech, 
Cardiff, UK). To achieve these levels 
of sensitivity (at which the photon 
flux may be a hundredth to a ten- 
thousandth of that of starlight), im- 
provements are required in all areas, 
with reagents, hardware and software 
all contributing to the achievement 
of the required sensitivity. 
. Fibre-optic arrays provide another 
means of interrogating many assays in 
parallel; etching microwells onto the 
end of such optical fibres allows 
assays to be performed in volumes as 
low as 90 fl Pavid Walt, Tufts Uni- 
versity, North Grafton, MA, USA). 
More accessibly, confocal methods 
exploiting fluorescence -correlation 
spectroscopy (FCS) can interrogate a 
volume of 1 u,m 3 (i.e. 1 fl), in which 
a 10 nM solution of a fluorophore 
contains on average six molecules. 
Analysis of the time course of fluctu- 
ations in their number density provide 
much information on their molecu- 
lar environment and, in particular, 
on whether they are bound or free; any 
'traditional' fluorescence assay may be 
configured for FCS (Keith Moore, 
SmithKline Beecham, Harlow, UK). 

The numbers game; tracking 
chemical diversity 

Imagine that there are just ten cru- 
cial and independent parameters 
(* explanatory variables') that can 
contribute to a drug's activity and 
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that obtaining a lead compound with 
a binding constant of 1 u,M or better 
requires that each of them is within 
±10% on a linear scale of the 'cor- 
rect' or optimal value. This means 
that we are looking for a single entity 
in 5 10 possible sets of properties (and 
for a precision of just twice this per 
explanatory variable, there are 10 10 
possibilities). Considerations such as 
this have led to the explosion of 
libraries of candidate drug com- 
pounds, typically containing 10 s — 10 6 
pure substances. To produce these is 
therefore as significant as the need to 
analyse their pharmacological behav- 
iour, and to assess and maximize the 
chemical diversity within a library is 
of particular importance. 

However, using the methods of 
combinatorial chemistry combined 
with evolutionary algorithms means 
that a potential library of 1 60 000 
(from a four-step chemical synthesis 
in which there are 10, 10, 40 and 40 
reagents of each type) can be decreased 
to just 400 experiments in which 
after a 'random* 20 mixtures, each 
reagent type is optimized over 19 
further generations (Klaus Gubemator, 
Combichem, San Diego, CA, USA). 
Indeed, the combination of chemi- 
cal diversity with intelligent com- 
puter analysis is crucial to. this type of 
enterprise.. Both active and inactive 
compounds contribute useful infor- 
mation as one seeks to decrease the 
search space for useful pharmaco- 
phores, especially if structural (rather 
than biophysical) parameters are used 
as the inputs (Susan Bassett, Bioreason, 
Santa Fe, NM, USA). In terms of 
finding metrics for assessing diversity 
(a generic problem that may also be 
applied to biodiversity metrics), good 
metrics for library design must always 
cluster molecules with similar bio- 
logical activities together (Dora 
Schnur, Pharmacopeia, Princeton, 
NJ, USA), and yet diversity metrics 
should seek to minimize the prox- 
imity of molecules in descriptor space 
(Adrienne Tymiak, Bristol-Myers 
Squibb Pharmaceutical Research 
Institute, Buffalo, NY, USA). A simi- 
larity index based on the average 
number of shared atom-pairing 
descriptors over the total number of 
such descriptors provides a robust 
metric for such analyses. 

Both chemical libraries and the 
results of assays using them produce 
huge amounts of data, which many 
workers in a large organization may 
wish to access. The design and con- 



struction of appropriate databases is 
thus another great need, and the 
interface must be constructed in a 
way that both makes navigation easy 
for the bench scientist and permits 
the facile deployment of powerful 
query and report tools. Only hierar- 
chical methods permit this with any 
convenience, and allow the user to 
organize data from thousands of drug 
screens. The Discovery explorer tool is 
one such implementation, which pro- 
vides decision support via a scalable, 
robust, flexible and enterprise- wide 
architecture (Anthony Kreamer, 
SmithKline Beecham, King ofPrussia, 
PA, USA). 

Given that the human genome 
probably contains some 70 000 genes, 
that one might wish to assay some 
10% of these and that half may be 
amenable to direct binding assays, the 
big pharmaceutical companies will 
certainly need to be looking at the 
results of several thousand screens 
(Mario Geysen, GlaxoWellcome, 
Research Triangle Park, NC, USA). 
If combinatorial chemistry is to be 
the main source of leads (as well as 
natural products), only the split-and- 
mix strategy of synthesis on solid sup- 
ports (beads) is appropriate; discrimi- 
nation between beads may be carried 
out by encoding them via a linker 
labelled with stable isotopes in various 
ratios. 

Small is beautiful: 
miniaturization in ultra-HTS 

A common, if arbitrary, definition 
of a system for ultra-HTS is one in 
which 100 000 assays are run per day. 
This is slighdy more than one per 
second and requires careful integra- 
tion of the necessarily robotic sys- 
tems, which deploy compound 
libraries, run the assays and analyse 
the data. With primary hit rates typi- 
cally running at 0.1%, it is critical to 
minimize both false positives and 
false negatives, and to ensure that the 
miniaturized assays in 1 536-weU 
plates with volumes under 10 uJ 
behave exacdy like those carried out 
in the test tube or the 96-well plate. 
Even the 1536-well plate has its 
competitors, as laboratory-on-a-chip 
systems (in which reagents, cells and 
drug candidates are mixed by elec- 
trokinetic forces operating in micro- 
fluidic channels of 10-100 u,m) allow 
complex assays such as those for Ca 2+ 
release to be operated at rates of 
2000 cells min -1 and allow several 
thousand replicates to be analysed 



in a total volume of less than 
20 jxl (Michael Knapp, Caliper 
Technologies, Palo Alto, CA, USA). 

However, analyses done under 
these ultra-HTS conditions must be 
considerably more robust than those 
initially developed by the scientists 
studying novel targets. The opti- 
mization of such assays represents a 
combinatorial problem as intractable 
as that described above regarding the 
statistical difficulty of optimizing a 
drug lead [even the question of 
whether to include one of 16 buffer 
components — never mind optimiz- 
ing its concentration — gives 2 16 (or 
65 536) possibilities requiring 683 
96-well plates if all are to be assessed]. 
Modem methods of experimental 
design can reduce this to just two or 
three such plates, and their preparation 
may be integrated into a laboratory 
robotics system (Frances Stewart, 
SmithKline Beecham, King ofPrussia, 
PA, USA). 

As assays become smaller, we enter 
the field of nanotechnology; 
nm-sized gold microspheres possess 
unusual optical properties (such as a 
molar extinction coefficient of some 
10 9 M" 1 cm -1 ) that change dramati- 
cally upon ligand binding. They have 
great potential for exploitation in dif- 
ferent types of binding measurements 
based on surface-plasmon resonance 
(Michael Natan, Pennsylvania State 
University, University Park, PA, 
USA) and in the highly selective 
detection ofDNA. In this technique, 
the chromophoric changes attending 
nucleic-acid binding to gold micro- 
spheres derivatized with complemen- 
tary nucleic acids are both unusually 
temperature dependent and perma- 
nent (Chad Mirkin, Northwestern 
University, Evanston, IL, USA). 

New ways to analyse cellular 
properties may also be gready assisted 
by modifying the biology. The yeast 
two-hybrid system is a well-known 
method of detecting protein- protein 
interactions in vivo but traditional 
versions can be rather tedious, as cells 
need to be cotransfected with both 
putative partners of the binding event. 
A new variant has, however, been 
developed in which an entire cDNA 
library containing the 'prey' is held 
in one yeast mating type and cells of 
the other mating type are trans- 
formed with the 'bait' (Yiwu He, 
GlaxoWellcome, Research Triangle 
Park, NC, USA). Coincubation of 
the cells followed by a chemilumi- 
nescent p-galactosidase reporter assay 
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means that one person can supervise 
the automated performance of 50 full 
assays per month, each typically pro- 
ducing 20 binding partners that 
might provide novel drug targets. 

Interrogating single molecules is 
clearly the ultimate in miniaturization 
and, although such methods cannot 
yet easily be parallelized, scanning- 
probe methods such as atomic-force 
microscopy with functionalized probes 
allow the direct and elegant measure- 
ment and discrimination of the inter- 
action between single molecules (Saul 
Tendler, Nottingham University, 
Nottingham, UK). 

And was it all worth it? 

Is the modern marriage of bio- 
chemical-, genomic- (and intuition-) 
based target development and HTS 
leading to new and useful drugs? 
Two examples suffice to indicate that 
the answer is resoundingly in the 
affirmative. 

Viramune is a novel, nonnucleo- 
sidic drug active against the reverse- 
transcriptase enzyme of the human- 
immunodeficiency-virus 1 (the major 
causative agent of AIDS), and was 
approved for use in 1996 0ohn 
Proudfoot, Boehringer Ingelheim 
Pharmaceuticals, Ridgefield, CT, 
USA). In 1988, it. was known only 
that the target enzyme contained two 
subunits (an X-ray structure became 
available in 1992) and a drug screen 
against it was initiated (at this time, 



HTS meant 100-200 compounds per 
week!). After testing approximately 
1600 molecules, of which 1% gave 
some kind of *hit\ a lead compound 
was discovered with an IC 50 of 6 »XM. 
Optimization of this compound led 
to a novel chemical entity with 
excellent pharmacological properties 
and an IC^ of 35 nM, low enough 
that therapeutic doses (giving a blood 
concentration of some 25 jxm) were 
active even against 'resistant* strains. 

It is now well established that the 
many complications of diabetes, such 
as nephropathy,\neuropathy and, in 
particular, retinopathy leading to 
blindness, are exacerbated when glu- 
cose levels are not well controlled. 
What has been much less clear is the 
mechanism by which chronic hyper- 
glycemia actually causes these compli- 
cations. Following detailed work at 
Sphinx Pharmaceuticals, which de- 
veloped assays specific for the many 
kinds of protein kinase, it was hy- 
pothesized that it was, in fact, an 
excessive activity of the p isoforms of 
protein-kinase C (PKC) induced by 
the higher levels of diacylglycerol 
formed under conditions of glucose 
excess that might be the major cul- 
prit (William Heath, Lilly Research 
Laboratories, Indianapolis, IN, USA). 

Despite management opposition 
(because non-selective inhibitors of 
the PKC family were known to be 
highly cytotoxic), a series of screens 
was developed against each of the 



eight human PKC isozymes. An 
indolecarbazole natural product 
related to staurosporine led to the 
development ofLY 333531, a highly 
selective molecule with an IC 50 of 
5 nM against the (3 isoforms but sev- 
eral hundred nM against the rest. 
Interestingly, even though the screen 
was intended to find molecules act- 
ing on the regulatory site of the PKC 
enzyme, it was actually the catalytic 
(ATP-binding) site that is the target, 
and the competitive inhibition was 
only observed when the assay was 
run at ATP levels significandy lower 
than those thought to be prevalent in 
vivo\ 'The ATP level in the cell is 
1 mM, but not everywhere in the cell.' 

Assay I say 

In conclusion, it is evident that the 
trend towards miniaturization, the 
intelligent generation and deploy- 
ment of chemical libraries, the inno- 
vative hardware and software, and the 
robust automation now available are 
major forces in the drive to develop 
new pharmaceuticals with novel tar- 
gets, high efficacy and, of course, sub- 
stantial commercial potential. How- 
ever, for these hopes to be realized, 
good assays will remain paramount. 

Douglas Kel! 

Institute of Biological Sciences, Aberystwyth, 

UK SY23 3DD, 
(E-mail: dbk@aber,ac.uk; 
WWW: hup: / fgepasi. dbs. aber. ac. uk/home. htm) 



The sheep and the goats 



Communication between the mak- 
ers and the users of laboratory 
equipment is always a problem. The 
user wants to know what is available, 
and to tell the maker the problems 
with their equipment; the maker 
wants to make sure that the user 
knows about the latest products, and 
to make sure their own products are 
as good as they can be. A recent meet- 
ing* gave ah excellent opportunity for 
this sort of exchange of information. 

The speakers were a varied mix- 
ture - some from the manufacturers 

*Bio-Europc '98: Bioseparation and Bioprocessing 
of Biological Molecules, the eighth annua] meet- 
ing organized by G. Subramanian, was held in 
Cambridge. UK, 7-9 September 1998. 



of different types of separation equip- 
ment, extolling the virtues of their 
latest devices, others from the sharp 
end, reporting their own experience 
of various techniques, yet others 
discussing issues arising from the 
procedures involved. 

Products 

There are many different sorts of 
separation equipment out there, in- 
cluding classical chromatography 
columns, packed and fluidized beds, 
and membrane-based systems. 
Harvey Brandwin (Pall Filtron, 
Northborough, MA, USA) discussed 
some of the problems of membrane- 
based systems, including the need to 
optimize each system individually. 



but emphasized their advantages, 
especially in virus removal. He also 
reported that Pall Filtron have devel- 
oped new filters capable of 3-log 
removal of viruses down to 20 run. 
When using filters with such narrow 
pores, it is essential to use prefilters, 
to prevent the filter rapidly clogging 
up, and they will also help when 
using larger-pore-size filters. 

W-D. Schleuning (Schering, Berlin, 
Germany) introduced new surrogate 
systems for evaluating the biological 
activity of new agents to replace the 
standard animal models. These in- 
cluded the use of hormone -sensitive 
promoters in yeast and the two- 
hybrid system, and also touched on 
issues of high -throughput-screening 
systems, which are already moving 
from 96-well to 3 84- well plates, and 
may in the future move to 864- and 
even 1536-well plates. 
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Drug screening— beyond the 
bottleneck 

High-throughput screening still has a long way to go if it is to achieve its often-touted 
potential. 

♦ * 

Alan Dove 



Screening for a drug is often likened to 
searching for a needle in a haystack. In fact, 
with the application of combinatorial 
approaches and parallel synthesis to discov- 
ery, the sheer scale and diversity of chemical 
libraries have made identifying drug leads 
more akin to searching for a needle in many 
different haystacks. Today, screening pro- 
grams may process up to 100,000 com- 
pounds a day, a thousand times as many as 
were processed in an entire week in 1990. To 
complicate the problem still further, a surfeit 
of new drug targets and their variants is 
emerging from genomics efforts. 

Faced with the daunting prospect of 
screening thousands, or even hundreds of 
thousands, of chemical structures in one day, 
against an expanding universe of targets, big 
pharma has turned to biotechnology compa- 
nies that specialize in equipment and services 
for high-throughput screening (HTS)— a 
term applied to several types of technologies 
for rapid testing or development of pharma- 
ceutical lead compounds. The majority of 
these companies offer a range of products 
and strategies that are as varied as they are 
numerous. Whether they can deliver results 
that match their ambitious claims remains 
uncertain. Already, pharmaceutical compa- 
nies are rapidly learning that clearing one 
obstacle in drug development often brings 
another into view. In fact, accomplishment 
of drug development goals may require a 
change not only in technology infrastruc- 
ture, but also in the culture of pharmaceuti- 
cal research itself. 

Increasing throughput 

The systematic screening of natural com- 
pounds (and more recently synthetic ones) 
to find potential drug leads has been a cor- 
nerstone of pharmaceutical research for 
many decades. Typically, a screening pro- 
gram commences with the development of 
an assay — originally performed in vivo, but 
now almost exclusively done using cells or 
targets in vitro — intended to mimic some, 
aspect of a disease of interest. Assay develop- 
ment is then carried out to determine opti- 
mal conditions and ensure that the system 
under study is both robust and reproducible. 
With a hit rate of 0. 1-0.5%, this is particular- 
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ly important, in that false positives and nega- Automation Partnership (Cambridge, UK), 
tives can lead to squandered time or The problem is that the resulting collection 
resources in the pursuit of unsuitable leads. of machinery may have to be reengineered 
The move to increase throughput (i.e., to work as an integrated system. Archer says 
the number of individual 
drug activity assays per- 
formed in a given time) has 
focused on "two areas. First, 
test volumes are shrinking 
from milliliters to microliters 
(and with miniaturization 
even nanoliters; see "Cutting 
screening down to size"). 
Thus, the 96-well microplate 
replaced 1 ml assays in the 
1980s, the 384-welI plate was 
introduced in 1994, and the 
1,536-well plate (with a 
capacity of 1-2 nl) is now 
being touted as the new 
industry standard. 

Second, processes are 
being automated to make 
them less labor intensive, 
more reproducible, and less 
expensive. Robots now per- 
form repetitive tasks like 
pipetting, assay reading, or 
sample storage, allowing com- 
pounds to be retrieved from 
an automated dispensary, 
applied to cultured cells for 
assay under tighdy controlled 

conditions, and scored for Figure 1. Caliper's microfluidic chfps are now being refined 
activity without human inter- for use in HTS. . 
vention. Some companies 
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have begun to refer to these new technolo- 
gies, which can push screening rates as high 
as 100,000 assays per day (slightly more than 
one per second), as ultrahigh- throughput 
screening (UHTS) 1 . 

In solving the problem of assay replica- 
tion, however, the automated systems have 
uncovered the difficulty of integrating 
diverse pieces of machinery and incorporat- 
ing the mass production operation into tra- 
ditional R&D culture. "As a rule, most of 
the major companies are buying what are 
generally referred to as platform technolo- 
gies. They've gone out and bought what 
they feel are the best technologies in each 
area," explains Richard Archer, CEO of The 



his company is working to solve this prob- 
lem by marketing "big boxes" — complete 
general-purpose HTS systems that can be 
purchased outright, providing everything 
from sample storage and automated cell 
culture to assay readout. 

Taking the integrated system concept a 
step further, Aurora Biosciences (San 
Diego, CA) sells its HTS systems in con- 
junction with proprietary cellular assays 
(see below). The integration of HTS and 
assay comes at a price, however. Rather 
than selling its systems off the shelf, Aurora 
licenses its technology with reach-through 
agreements that include milestone pay- 
ments and royalties on any drugs developed 
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'^■<:^BM Some observers down- 
play this problem, but 
Archer contends that it 
will be a key issue in 
the development of 
HTS, suggesting that 
companies should 
establish physically 
separate, factory-like 
HTS facilities with 
dedicated staffs 2 . At 
least one company, 
Bristol-Myers Squibb 
(Princeton, NJ), seems 
to have taken that 
advice to heart, build- 
ing a new wing to 

Figure 2. Big is beautiful. The factory-like HTS system maketed by accommodate an HTS 
the Automation Partnership. system. 




with the system. This approach is some- 
what unusual, as most HTS providers sell 
equipment outright or provide screening 
services on a contract basis. 

Although the assay systems are getting 
smaller, the machines handling them still 
dwarf the benchtop equipment to which 
most researchers are accustomed. Archer's 
company markets systems that are "about 
150 feet long, about 30 feet high, and weigh 
200 tons " Systems on this type of scale are 
very alien to most researchers working in 
R&D programs, who are used to working in 
more of an academic atmosphere rather 
than a factory-like environment, in which 
high numbers of repetitive operations are 
performed around the clock. 



Assay, assay, assay 

Whether HTS is carried out in a laboratory or 
a "factory" building the capacity to perform 
assays rapidly and reliably may not be the 
most serious hurdle. Ultimately, it is the assay 
itself that will determine the success and relia- 
bility of screening results, and thus compa- 
nies are actively searching for new systems to 
supplement conventional drug activity tests. 
One area in which several companies are 
working is the use of screening approaches 
based on gene expression patterns to find 
novel drug targets. 

According to Richard Shimkets, director 
of internal research at CuraGen (New 
Haven, CT), many of the traditional gene 
profiling approaches (e.g., serial analysis of 
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gene expression, differential display, or sub- 
tractive hybridization) are not amenable to 
high-throughput applications. What's 
more, it is difficult to gauge the impact of 
current DNA array-based technologies on 
HTS: "I think this is an area that people are 
starting to commit to, but I think they've 
barely scratched the surface of it," says 
Robert Lipshutz, vice president of corporate 
development at Affymetrix (Santa Clara, 
CA), a supplier of expression profiling DNA 
chips. Indeed, Mark Benjamin, senior direc- 
tor of business development at Rosetta 
Inpharmatics (Kirkland, WA), is skeptical 
about the long-term prospects for standard 
DNA arrays in HTS. "The first steps require 
exposing cells and then isolating RNA, 
which is something that's very hard to do 
high- throughput," he says. 

Another drawback is that most of the use- 
ful drug targets are likely to be unknown 
(particularly in the agricultural sciences 
where genome sequencing has only just got 
underway), and DNA arrays that are current- 
ly available test only for previously 
sequenced genes. Indeed, some argue that 
current DNA arrays may not be sufficiendy 
sensitive to detect the low expression levels of 
genes encoding targets of particular interest 

To address some of these issues, Shimkets 
and researchers at CuraGen have developed 
a rapid restriction enzyme-based mRNA 
profiling technique (Nat. Biotechnol. 17, 
798-803, 1999), in which cDNA prepared 
from mRNA is digested with individual 
pairs of restriction enzymes, amplified with 
PCR, and the lengths of the resultant frag- 
ments compared with known sequence frag- 
ments lodged in a database. Importantly, the 
presence of unpredicted fragmenb flags 
novel genes, thus allowing differentially 
expressed genes to be cloned, even if they are 
not represented in expressed sequence tag 
(EST) libraries. 

Researchers at Aurora Biosciences are 
using genome-wide, gene-tagging strategies 
to identify novel genes and study expression 
of known genes in cell lines in response to 
drug treatment. These assays employ a pro- 
moter-trapping strategy using a P-lactamase 
reporter and fluorescent substrate analogs to 
measure promoter activity. The resultant p- 
lactamase-tagged cell library is subjected to 
fluorescent-activated cell sorting to rapidly 
separate clones where the tagged gene is con- 
stitutively expressed, and sequential rounds 
of sorting can be used to identify cell clones 
whose genes are induced or repressed by dif- 
ferent drug candidates. 

Whereas Aurora's approach primarily 
makes use of mammalian cells, other com- 
panies are pursuing yeast genetics to develop 
high -throughput assays. Variants on the 
yeast two-hybrid system, which is widely 
used for analyzing protein-protein interac- 
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tions in vivo, are now under development at 
several pharmaceutical companies to screen 
for small molecules 3,4 . Many of these systems 
attempt to overcome some of the traditional 
problems of two-hybrid systems that make 
them unsuitable for HTS, such as false posi- 
tives and complex assay preparation 3 . 

Other types of functional screens in yeast 
are also under development. For example, 
Cadus Pharmaceuticals (Tarrytown, NY) is 



functionally inserting mammalian G-protein- 
coupled receptors (GPCRs) and associated sig- 
naling proteins into the pheromone response 
pathway of yeast to develop screens for identi- 
fying modulators that art on components of 
the GPCR pathway (see p. 878 for details). 

To approximate in vivo systems more 
closely, screens based on the nematode 
Caenorhabditis elegans have also been devel- 
oped. The small size (it consists of 98 cells 
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and is only 1 mm in length), easy culture, and 
extensive knowledge regarding C. elegans 
biology makes the worm readily amenable 
for use in HTS. Companies such as Axys 
Pharmaceuticals (S. San Francisco, CA) and 
Exelixis Pharmaceuticals (S. San Francisco, 
CA) currently use automated nematode 
screens to evaluate the organism's anatomy 
and behavior, both before and after genetic 
manipulation, in response to drugs. 

Elsewhere, researchers are designing 
more exotic assays to supplement cellular 
assays. For example, Chad Mirkin of 
Northwestern University recently reported a 
system that exploits the optical properties of 
nanometer-sized gold microspheres. Mirkin 
claims that microspheres derivatized with 
nucleic acids show reproducible and robust 
chromophoric changes on ligand binding 1 . 
Other laboratories are also developing assays 
that allow lipid-soluble proteins (e.g., G- 
protein receptors that are intractable to con- 
ventional assays) to be screened by array on 
isolated membranes. 

Another challenge is to design reliable 
assays to monitor a target gene product's 
expression or activity. In this context, some 
of the technologies lumped under the term 
"proteomics" may prove useful. Microarray 
and chip-based protein binding assays like 
those being developed by Ciphergen (Palo 
Alto, CA) and Large Scale Biology (Rockville, 
MD) may facilitate high -throughput target 
identification, whereas the mass spectrome- 
try approaches being pursued by companies 
like Oxford Glycosciences (Cambridge, UK) 
might provide more detailed information on 
individual targets. The preliminary nature of 
most current technology in proteomics, 
however, makes it difficult to predict the 
impact of this field on HTS (Nat. Biotechnol. 
17,233-236,1999). 

Lastly, a useful assay must be compatible 
with a company's existing automated HTS 
system. In an effort to integrate assay devel- 
opment and execution, some manufactur- 
ers of HTS equipment produce benchtop 
versions of their systems, but the rapid 
development of the field and the number of 
vendors producing equipment make it dif- 
ficult to determine whether assay scaleabili- 
ty will become a serious stumbling block in 
the future. 

Theeve of ADME 

If novel targets can be found and new assays 
developed, the next bottleneck in HTS will be 
at the other end of the process: testing the 
new lead compounds to determine their tox- 
icity and metabolism. Activating or inhibit- 
ing a particular gene product in cultured cells 
is a far cry from treating a disease in a whole 
organism, and most assays for pharmacoki- 
netics and toxicology currently rely on 
expensive, time-consuming animal tests. 
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Accelerating studies on the absorption, diffu- 
sion, metabolism, excretion (ADME) and 
toxicology of a drug poses a monumental 
obstacle, but one that some HTS companies 
are hoping to surmount. 

"Most lead optimization is done the 
same way primary screening was done 10 
years ago — it's bench top biology" says 
Lansing Taylor, CEO of Cellomics 
(Pittsburgh, PA). As automated HTS churns 
out increasing numbers of lead compounds, 
ADME researchers are finding themselves 
overwhelmed. In collaboration with the 
optics manufacturer Carl Zeiss 
(Thornwood, NY), Cellomics has developed 
automated assay systems that provide a 
detailed view of the behavior of gene prod- 
ucts in whole cells under different condi- 
tions, an approach that Taylor refers to as 
"high- content screening [or HCS] " 

The idea is to automate the experiments 
that would be the focus of early-stage ADME 
and toxicology studies, and to weed out com- 

Table 1. Technologies and customers of some 



pounds with unsuitable behaviors in whole 
cells. Rather than replacing existing assays, the 
Cellomics technology is designed to be incor- 
porated into existing HTS systems. After a 
"first round" high-throughput screen to iden- 
tify compounds with a desired activity, 
flagged compounds enter a round of more 
complicated tests for toxicity. This second 
round is referred to as "high-content screen- 
ing" or HCS because of the amount of addi- 
tional information it provides. "If you have. . 
.a 1,536- well plate, and a 0.5% hit statistic, 
that means you're going to have seven or eight 
hits per plate. That plate is now transferred 
into the HCS reader, which now only has to 
read seven or eight wells," Taylor explains. 

DNA microarrays have also been touted 
as a tool for toxicological studies, since toxic- 
ity generally involves changes in gene expres- 
sion patterns, but the arrays used for general- 
purpose screening will probably be unsuit- 
able. Spencer Farr, CEO of Phase I Molecular 
Toxicology (Santa Fe, NM), cautions that 



"you certainly don't want to walk into the 
FDA and say, 'With this compound, we saw 
the following 27 genes turned on 30-fold, but 
we have no idea what they are!'" Instead, a 
few HTS companies are trying to develop 
arrays of surrogate markers that are known 
to be activated by particular types of toxins. 

Affymetrix now produces a dedicated 
toxicology array based on a rat model, and 
Phase I is pursuing several approaches to 
high- throughput toxicology. "We try and 
put together assays where we glean more 
than just the old 'when does it die? 1 kind of 
information. More importantly, why is it 
sick?" says Farr. In addition to surrogate 
marker arrays, Phase I has also developed 
hypersensitive cell lines that lack crucial 
repair machinery. 

Whether the systems are designed for 
ADME or toxicological assays, the guiding 
principle is to "fail fast and fail cheap " elimi- 
nating unpromising lead compounds from 
consideration as quickly as possible; howev- 
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er, the preliminary nature of the work in this 
area suggests that lead optimization is a long 
way from achieving this goal at present. 

Deconvoluting the data 

Although HTS companies differ with regard 
to the relative significance of the bottlenecks 
in AD ME, assay development, or hardware 
integration, there is widespread agreement 
that the biggest obstacle in HTS is collating, 
deconvoluting, analyzing, sorting, and stor- 
ing the information derived from the assays 
(i.e., bioinformatics). 

Although it is referred to as a single 
problem, bioinformatics is likely to create 
several distinct obstacles to the develop- 
ment of HTS. Mark Benjamin at Rosetta 
explains that the first hurdle is the decep- 
tively simple problem of information stor- 
age and retrieval: "Everyone who is using 
more than a couple of microarrays a day is 
saying *I can't store all this stuff in 
Microsoft Excel.*" Microarrays, which may 
produce in excess of 100,000 data points for 
a single experiment, are not the only source 
of trouble. Even "simple" HTS programs 
are often testing combinatorial libraries of 
more than a million compounds, easily 
overwhelming most standard • database 
software. 

One solution is for the HTS companies 
themselves to develop the informatics tools 
their customers will need. CuraGen, which 
carries out HTS for hire, analyzes a cus- 
tomer^ samples with its proprietary screen- 
ing technology, then provides the results and 
database services electronically. "Everyone 
accesses this information through the same 
Internet connection," says Shimkets. 

The future of many HTS companies may 
well lie in subcontracting for pharmaceuti- 
cal companies, but many contend that large 
companies will want to keep their HTS 
activities to themselves (see "Do-it-yourself 
HTS"). For in-house programs, companies 
like Rosetta Inpharmatics are hoping to 
provide bioinformatics tools and products 
that customers can use as they see fit. 
Rosetta recently . acquired Acacia 
Biosciences (Richmond, CA), adding that 
company's yeast bioassay to a technology 
portfolio that already included DNA arrays. 
Mark Benjamin asserts that Rosetta's bioin- 
formatics products are designed to facilitate 
the use of its arrays, whether the arrays are 
used for HTS, early stages of assay develop- 
ment, or basic research. Rosetta also plans 
to market its software as a stand-alone 
product for use with other brands of. 
microarrays. ■ \ ■ 

One of the other brands that benefits from 
separate software is Affymetrix. In contrast to 
Rosetta's approach, AfFymetrix has focused on 
producing chip hardware, which is sold with 
software that converts results from the chip 



into a standard data format called GATC 
(Genetic Analysis Technology Consortium). 
Since the standard has been made public, 
other companies are free to write software to 
analyze Affymetrix chip output. 

Competition and specialization may be 
necessary to address the next bioinformatics 
hurdle in HTS. "I'm collecting all of these 
data, [but] I'm not really interested in stor- 
ing and retrieving and analyzing. What I 
[really] want to know is how can I use this to 
actually design a better drug?" says 
Benjamin. Affymetrix's Lipshutz agrees, 
adding that the capabilities of microarray 
hardware are now being restricted by bioin- 
formatics problems: "It's not the assay per se 
that's useful, it's basically understanding 
how to interpret the data biologically." 

Growing pains 

Ultimately, the test of HTS will be whether 
the technology can live up to its hype. "The 
worry we have is that HTS is not only 
important in the commercial sense, but it's 
important to be seen working in the area," 
says Archer. As a result, some pharmaceuti- 
cal companies may be pouring resources 
into HTS programs without carefully con- 
sidering the limits of the approach and the 
serious obstacles it still faces. "Wall Street is 
going to be asking not what machines have 
you put into your facility, but how many 
hits per day are you generating in reality. If 
[HTS] fails to deliver. . .it will fall into disre- 
pute for reasons that have nothing to do 
with the technology," says Archer. 

Indeed, a problem that has remained 
largely unaddressed by current HTS efforts 
lies outside the technology, in the infra- 
structure and attitudes of traditional phar- 
maceutical R&D programs. The model of 
academic research, in which small groups 
collaborate on different aspects of a prob- 
lem- while retaining some independence 
from each other, may be a recipe for disas- 
ter when applied to HTS. If an assay devel- 
opment team is not intimately familiar 
with the limitations of automated screen- 
ing equipment, new assays may force the 
screening group to retool a roomful of 
equipment. Similarly, a library of com- 
pounds built around a common backbone 
that is poorly absorbed in vivo may pro- 
duce hundreds of hits in a preliminary 
screen, only to have them eliminated in 
late-stage ADME tests. 
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A Review of Automation Options to Support Plate Preparation, 
, Cherry Picking, and Homogeneous Assays 
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ABSTRACT 

Developments in high throughput screening (HTS) have led to new needs in automation to enable better handling 
of applications such as homogeneous assays and cherry picking. Software and hardware integration approaches for 
screening automation have been changing in concert with these new application needs. The result of this combina- 
tion has been the production of robotic systems for drug discovery with improved stability and functionality. This 
review critically assesses some Zymark, Tecan, and Beckman solutions for current HTS requirements. 




INTRODUCTION 

IN the last four years, robotic high throughput screening 
(HTS) for drug discovery has moved from being a desire of 
v. most companies to a reality. 1 * 2 During this period, robotics and 
their control software have evolved from systems which were 

eneration - 
and flexibil- 
ity (e.g., . Zymark Virtuoso™, Hopkinton, MA; Beckman Core 
Systems™, Fullerton, CA; Tecan Genesis™, Switzerland). 
-However, mere -is- -further room for ■ improvement in primary 
screening robotic systems. Promise for improvements may 
come from newly developed ultra-high throughput screening 
(UHTS) technologies, e.g. Aurora, (San Diego, CA), Pharma- 
copeia (Princeton, NJ) and such products as the Zymark Alle- 
gro™ (Hopkinton, MA) which take a more industrialized "as- 
sembly-line" approach to microplate processing. 3 * 4 * 5 Reducing 
screening cost and increasing throughput are the primary driv- 
ers for novel HTS technologies and more robust industrialized 
robotics. 

The successful implementation of primary screening in re- 
cent years has resulted in new needs for automation, partic- 
ularly for secondary screening. These needs stem from the 
nature of secondary screening, which generally requires more 



elaborate microplate preparation procedures. Additionally, 
secondary screening is greatly facilitated by automating com- 
pound "re-array" based on -the discovery of hits in primary 
screening (also called "cherry picking"). Challenges in areas 
of plate preparation, daughter plate creation and "cherry pick- 
ing" include software and hardware flexibility, pinettine ac- 

terns that perform these tasks are also capable of performing 
homogeneous assays. The purpose of this article' is to review 
the current commercial options for these tasks based on 
needs, features7aiid cost. The" systems will" cover include"-' 
the Zymark Virtuoso™ (Hopkinton, MA), Tecan Genesis 
configurations, Sagian Core Systems™, and Sagian Core 
Generations™ (Fullerton, CA). 



GENERAL SYSTEM OVERVIEW 

Zymark Virtuoso™ (Figs. J and 2) 

This new product from Zymark is a departure from the com- 
pany's past approach of custom systems. These systems can 
be configured at the time of purchase based on a selection of 
available modules/devices. A common configuration which 



'HTS Consulting Ltd., 2238 Adrian St., Thousand*Oaks, CA 91320. 
2 Source Biopharmaceuticals Inc., Boulder, CO 80304. 
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FIG. 1. Zymark Virtuoso 1. This standard product by Zymark is best used for "cherry picking" and homogenous assays tasks. This 
photographed system does not have the optional Rapidplate 96/384 installed. The normal position of the Rapidplate 96/384 is shown 
in the diagrammatic representation of Virtuoso 1 (Fig. 2). (Photograph courtesy of Zymark.) 
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FIG. 2. Diagrammatic representation of the Zymark Virtuoso 1. All possible modules for a Virtuoso 1 configuration are shown. 
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pharmaceutical companies have begun purchasing for sample 
preparation, mother-daughter plate creation and "cherry pick- 
ing" includes the following hardware: a 176-plate storage 
carousel (temperature control is an option); a Rapidplate® 
„.9.6/384 channel pipetter; a Presto?™ liquid handler with both 
. an eight and a single channel pipetter (a Cavro [Sunnyvale, 
CA] OEM product with the option of either Teflon cannulae 
or removable pipet tips); and an EG&G Wallac (Gaithersburg, 
MD) Victor™ plate reader. The Rapidplate may be configured 
to dispose of pipette tips after each use, or wash and reuse 
these tips. Microplates are transported by a track mounted Zy- 
mate® XP™ robot. Other configurations are certainly possible 
based on .purchaser's module selection from a list of Zymark 
standardized choices. 

Tecan Genesis™ 

The Genesis has been put to many uses including combina- 
torial chemistry, plate preparation, and screening. Some rea- 
sons for its success include its pipetting accuracy, choice of 
deck size ranging from 1 to 2 meters, eight independently ad- 
dressable X, Y, Z cannulae (Teflon-coated washable cannulae 
or pipette-tip attaching cannulae), and the ROMA™ gripper tool 
for plate transport. Recently, Tecan has offered the ability to 
customize the Genesis by interfacing plate processing equip- 
ment. These options include a carousel (210 microplates), 96- 
channel pipetter (to date only the Matrix [Lowell, MA] Plate- 
Mate™ and Tomtec [Hamden, CT] QUADRA 96™ pipetters 
have been integrated), a plate washer, and plate readers. These 
options make the Genesis capable for plate preparation, "cherry 
picking" and assays. 

Sagian Core Generations™ 

These small and simple integrated systems come standard 
as a 1 meter ORCA™ robot servicing a carousel and a Multi- 
mek™ (96- well pipetter), together with bar code reading ca- 



channel pipetter, a plate storage carousel, a C0 2 incubator, read- 
ers, print & apply bar-coding, and filtration, as well as other 
devices. A central, track-based ORCA robot transports rnP 
croplates to/from the various workstations (tracks can be one 
to three meters in length). These , systems come: standard with 
Sagians' SAMI™ icon-based scheduler, which supports opti- 
mized scheduling. 



SYSTEM STRENGTHS AND WEAKNESSES 

Zymark Virtuoso™ 

Given the potential functionality, of Virtuoso, the system 
has a low price compared to other competing products. Per- 
haps the strongest feature of this system is the pipetting 
combination of the Rapidplate 96/384 and the Presto liquid 
handler. We have found the Rapidplate 96/384 to be the 
most accurate and reliable fully automated, robot compatible 
96/384 channel pipetter on the market for liquid transfers in- 
the 2-150 pi range (1 fil with a 6% CV is possible in multi- 
aspirate mode). It is also among the least expensive. The 
Presto Liquid handler offers the reliability of Cavro pipetting 
devices with a very small footprint and low cost. Based on 
our experiences, the combination of these Zymark liquid han- 
dlers out performs the Beckman Multirnek/Biomek combina- 
tion in cost, unit size, and pipetting ability. The Virtuoso sys- 
tem has its own table for support and cannot be placed onto 
a laboratory bench. 

The Virtuoso is homogeneous assay capable, which separates 
it from the other simpler systems in the same cost bracket. The 
Presto liquid handler possesses the Cartesian positional accu- 
racy to address 384-well plates. Microplate "cherry picking" is 
done via the 'Presto single-channel pipetting arm, which can be 
slow if there are a high number of "hits" to be picked from 
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microplates or deepwell plates. Beckman (via their Sagian di- 
vision) can add other modules, such as plate sealers and 
print/apply bar code labelers, making the. system a ..custom 2 
or 3 meter rail system purchase rather than the Core Genera- 
tions product. The Generations system is driven via an easy to 
use icon-based GUI written specifically for the Generations 
system. Setup and configuration of the Multimek is accom- 
plished via that devices own GUI. The Multimek retains its 
front panel interface for manual control. Liquid transfer with 
the Multimek is pipette-tip based for transfers of 1-200 jjlL 
These tips can be discarded after each use or washed and 
reused. A fixed, washable tip option is available for transfer- 
ring volumes of 1-20 /xl. 

Sagian Core Systems™ (Fig. 3) 

This type of system differs from those above in both size and 
complexity. These systems can complete microplate prepara- 
tion and cherry picking tasks. In addition, functionality extends 
to the automation of complex assays. Sagian Core Systems in- 
clude such items as a Biomek™ 2000 pipetter", a Multimek 96- 
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384-well compatibility is very important as many companies 
are attempting to transition a substantial percentage of their as- 
says to this format in 1998. 

Zy mark's current method building and scheduling software 
for Virtuoso is PCS™ 3.0 running on Windows® 95 or NT™. 
Although this software lacks some of the power of Beckman *s 
SAMI NT/SILAS™ control software, it is simpler to use and 
learn. Zymark offers as an option "cherry picking" software, 
which supports the use of both the Rapidplate 96/384 and 
Presto™ liquid handler to create "hit" plates or daughter plates. 
For both PCS and the "cherry picking" software, links to com- 
mon databases and data storage methods are provided. The new 
"cherry picking" software is easy to use and very visual in its 
design. 

Tecan Genesis™ (FACTS) 

Eight independently addressable cannulae give Tecan Gen- 
esis™ systems the best "cherry picking" 96-well liquid han- 
dling mechanics of all devices on the market. This feature al- 
lows up to eight "picks" to be done from a single plate before 
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FIG. 3. Beckman-SAGIAN Core System. Visible is the ORCA robotic arm, Biomek 2000 a microplate washer and microplate stor- 
age hotel. (Photograph courtesy of Beckman-Coulter.) 



tip washing is needed. However, maintaining the positional ac- 
curacy of eight independent long cannulae may make 384- well 
addressing a maintenance issue. The Genesis offers the option 
of both disposable pipette tips or fixed washable cannulae for 
liqui&^n^ejs,^^ WWJ^ej): wa5hablc.-tips-»is. 
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and time savings. The standard Genesis systems lack compo- 
nents which are essential to streamlined "cherry picking" 
processes. By the very nature of "cherry picking", one has to 
access many mother plates to create a few daughter plates; 
therefore, a high-capacity microplate storage device is an im- 
portant module. Additionally, a 96 channel pipetter is also a 
useful feature for "cherry picking" systems to enable rapid mi- 
croplate compound dilutions in a single action. The same com- 
ponents are also very important in an efficient system for plate 
dilutions and standard mother-daughter plate creation. Once a 
carousel and a 96-channel pipetter are added, the Genesis 
FACTS has a high cost relative to the Zymark Virtuoso. At the 
time of this review, the addition of both a carousel and a 96- 
channel pipetter is not a standard product for Tecan. The com- 
pany's experience to date covers only the Matrix PlateMate and 
Tomtec QUADRA 96 channel pipetters, which could poten- 
tially be coupled to a 210 microplate carousel. 

The Tecan Genesis can be configured for screening work 
with combinations of a plate reader (Tecan Spectra™ series) 
and washer (Tecan 96 PW) and the modules mentioned above. 
While this is feasible, Tecan has not yet coupled that number 
of larger modules to one Genesis. Tecan provides software 



which has good functionality (method builder and a dynamic 
scheduler), although it is somewhat disjointed (e.g., the pipet- 
ting method builder and scheduler do not exist under the same 
shell, or ".exe" program). As with Zymark 's Virtuoso, Tecan 

-^cr&se<f:^e4ten<iar^ 

easily fits onto a standard lab benchtop. However the addition 
of, a Carousel and 96-channel pipetting device make the over- 
all combination ..rather large and cumbersome. _ • . 

Sagian Core Generations™ 

These systems are generally small and therefore low in com- 
plexity and price. Consequently, functionality is limited. This 
product has been used within companies for mother-daughter 
plate creation. Such a system configuration includes a carousel, 
bar code reader and Multimek 96-channel pipetter. Utilizing the 
ORCA arm, this system occupies a very efficient footprint that 
could reside on a 36" lab benchtop. The standard systems are 
not adaptable to "cherry picking" which would require a Bio- 
mek 2000™ (generally only placed on larger Sagian Core sys- 
tems). Additionally, Core Generations systems are not assay 
capable, in that an interface to a plate reader or reagent addi- - 
tion system is not part of the Generations package. 

In general, we have found that the Multimek is slower at 96- 
channel pipetting operations than the Rapidplate 96/384. This 
difference is marked when not reusing/washing pipette tips, as 
the Multimek tip attaching actions are slow. Exact comparisons 
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of pipetting accuracy depend on the range of transfer involved 
. *: - and the exact nature of the liquid. Generally, we have had fewer 
problems with transfer inaccuracy with the Zymark Rapidplate' 
| 96/384. 

j; . Because of the nature of the Core Generations system soft- 
■ ware, changes in the hardware configuration are not easily ac- 
v complished. For example, the current GUI does not allow the 
• ': user to compensate for changing racks in the storage carousel 
! between microplate and deepwell types. Instead these changes 
: . have to be made in a carousel configuration file. This type of 
file manipulation can easily lead to errors in method execution, 
• so extra care must be taken when making and saving changes. 
Beckman/Sagian will customize their standard GUI interface to 
meet specific customer needs. While this custom interface does 
: not include a scheduler (Sagian provides their icon-based SAME 
scheduler for their more complex Core systems), this is not a 
limitation due to the focused task nature of these systems. 

Sagian Core Systems'™ 

These systems provide a wide range of potential functional- 
ity and features which are best suited for the automation of 
more complex procedures, such as heterogeneous screening as- 
says. The S AMI/NT/SILAS™ software architecture offers a 
highly flexible, powerful, but costly package for system sched- 
uling, control and interfacing. The ORCA and Biomek 2000 
mechanics allow space efficient layouts. However, we have 
found that smaller and simpler automated systems are more 
suitable for focused, specific tasks. Therefore, for support of 
secondary screening, for tasks such as plate preparation, and 
for "cherry picking", the complexity and cost of the larger 
Beckman systems is difficult to justify with simpler lower cost 
solutions available. 



SUMMARY 



•fc^*paiftteS&- ifriplemtntafibn MKr * a Tlnif^&ov"^ 'pKgSuK 
This selection is both a function of vendor options and the na- 
ture of the task to be automated. Additionally, not every screen- 
ing .related activity should be automated. Recently, vendors 
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have switched to simpler approaches for automation solutions, 
stressing standardization rather than customization. The recent 
convergence of homogeneous assays, robotics designed for ho- 
mogeneous protocols, and logical plate preparation paradigms 
will hopefully make the coming years of screening robotics a' 
far more productive investment than previous years. Judging 
an automation supplier by performance in a previous project of 
a specific nature, may not be the best strategy for selection of 
a new technology for a different application. Each of the ven- 
dors (e.g., Zymark, Beckman- Sagian, Tecan etc.) have partic- 
ular screening automation niches in which, they are strong. For 
example, this manuscript describes Zymark as being particu- 
larly strong for homogeneous and cherry picking applications. 
In contrast, Beckman can be viewed as stronger in the arena of 
heterogeneous cell based assays. Furthermore, technology se- 
lection for the same task will vary among purchasers based on 
differing priorities in throughput, reliability, and cost. 
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§ Robotic equipment and microsystem 
technology in biological research 

Holger Eickhoff, Igorlvanov, Markus Kietzmann, ElmarMaier, 
Markus Kalkum, David Bancroft and Hans Lehrach 



2.1 Introduction 

Each cell of a living organism contains the whole genetic information in form 
of DNA molecules. The size of the DNA from a single cell, or genome, of 
human beings is 3 x 10 9 nucleotide base pairs. Although the DNA information 
is usually identical in each cell there are several hundred different cell types. 
This is due to the fact that genetic information is read out from genes and tran- 
scribed from DNA into a cell-specific population of mRNA molecules, which 
itself can be further translated into different types of proteins. Every step of 
these cellular processes includes complex interactions of DNA, RNA and pro- 
tein (Alberts, Bray et al., 1994). To understand these interaction mechanisms 
scientists started to decode the genetic information (Dulbecco, 1986). This task 
became finally a major goal of the Human Genome Project (Cantor, 1990). 

The benefits of this project are visible already before one-tenth of the 
genome sequence has been revealed. Genomic databases have enabled 
scientist to access, retrieve and process biological information (Zehetner and 
Lehrach, 1994). At the same time, the Human Genome Project has changed 
the attitude and direction of biological research (Tilghman, 1996). Currently, 
the interest of researchers is focused on finding genes, analysing their ex- 
' pressron jpatf ernsjinq -their in vivo -functions as' \vel£as furttiSr featBic-s : 'hx. 
corresponding proteins. The order and the expression profile of biological 
information is another level of complexity even more important for the .under- 
standing of organisms. Genes whose expression is highly specific to a tissue, 
organ, cell type or disease may be attractive as targets for the development of 
highly specific therapeutics and diagnostic (Maier, Meier-Ewert et al., 1997). 

Since there are approximately 100000 genes predicted for the human 
genome, new methods and reliable techniques for processing many samples in 
parallel and at high throughput are needed. 
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Here we describe how automated robotic systems can facilitate biological 
research. Robots have been developed mainly for the parallel analysis 
and the characterization of large DNA array (Meier-Ewert, Maier et al., 1993, 
1994). These automated techniques allow the examination of tens of 
thousands of clones in parallel by hybridisation-based approaches. We also 
show how to implement the principles of these robotic systems for other bio- 
logical . tasks including protein analysis and the characterisation of gene 
expression. 



2.2 Hybridisation based approaches to genome analysis 

Most of the methods for DNA characterisation are based on the fundamental 
fact that DNA is able to form a full or a partially complementary double 
helix hybrid from two separate single stranded nucleotide chains (Watson 
and Crick, 1953). Hybridisation is the interaction of two; DNA strands. To 
detect hybridisation events one strand (target) is usually immobilised on a 
solid support, e.g. nylon membranes, whereas its counterpart (probe) is 
fished out by the target from a hybridisation solution. The probe is labelled 
and the hybridisation is detected by measuring the signal on the solid sur- 
face in the region of immobilised target (Wetmur, 1991; Meinkoth and Wahl, 
1984). 

Hybridisation approaches are important tools for large scale DNA charac- 
terisation and require, among other things, upstream clone picking and 
spotting, the probe hybridisation itself, and downstream image and computer 
analysis (Lehrach, Bancroft et al., 1997). First of all, a pool of DNA molecules 
,-rtaieratialj^d:.js pKpaissiJQk^ ^£ol^ ; J^r ^ • — • — 

domly spread colonies of bacteria are grown on agar plates. Each colony 
carries a unique DNA fragment or clone. Since bacteria can carry only a rela- 
'"'"7 tively short DNA fragment, a large number of clones, or a clone library, is " 
needed for a full coverage of a genome or even a tissue-specific cDN A library. 
A typical size of a cDN A library is a hundred thousand clones. After picking, . 
selected clones can be grown and kept in microtitre plates. This allows long- 
term storage, analysis and subsequent retrieval of individual clones. Clones 
from microtitre plates caji be used for DNA amplification by PCR or they can 
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be arrayed on nylon membranes for subsequent hybridisation with specific 
probes (Meier-Ewert, Maier et al., 1993). 



Automated clone picking 

The pattern of colonies randomly grown on agar plates is checked by an image 
analysis system to address the position of the colonies for picking. The ran- 
domly grouped, proportioned and shaped bacterial colonies are automatical- 
ly selected on the basis of given criteria: colour, shape, size. The image analy- 
sis software is able to recognise clones as small as 0.5 mm in diameter and 
select for blue/white genetic systems in E. coll After defining the colonies' 
position, software translates coordinates into robot movement for picking. 
One pin of the picking head (Fig. 1) touches the colony. Then the 96-pins of 
the picking head transfer and inoculate colonies in a microtitre plate for 
growth and storage. 

We have integrated the picking feature in a flat-bed robot being capable 
of picking and spotting. In the past 8 years we designed and tested several 




Figure 1 Picking Head: The picking head consists of three major parts: a CCD camera for taking a picture 
of an agar colony tray (left side), an x, y moving table that guides the pressure line to the chosen pins and the 
picking tool with 96 pins 
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generations of clone picking systems. The system is capable to pick and 
r inoculate approximately 3000 clones per hour into 384-well microtitre plates* 

" (Maier, 1995). 

Robotics systems for automated clone arraying 

After picking and growing thousands of colonies in microtitre plates, the colo- 
nies are arrayed with a 384-pin head (Fig. 2) onto nylon membranes. The 
spotting head is moved on a servo-controlled, three-axis, linear drive system 
with an accuracy of 25 pm. A complete spotting run includes the handling of 
up to 72 microtitre plates, bar-code, reading, lid lifting, 384 parallel clone 
transfers, pin sterilisation and pin drying. The volume of liquid transferred 
with a pin depends on the tip diameter, which varies from 150 yim to 450 )im . 
which corresponds from 5 nl to 50 nl liquid volume. The smaller the pin dia- 
meter, the higher the spotting density that can be achieved. For routine ope- 
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rations 27648 clones are spotted in a duplicate pattern as a 5 x 5 box format 
around 2304 guide dots per 22 cm x 22 cm nylon membrane. The duplicate 
spotting simplifies detection and identification of positive clones after each 
hybridisation. The- nylon membranes have been reused at least 20 times with- 
out significant loss of hybridisation information. With the system described 
here it is possible to immobilise and analyse up to 147456 clones on a single 
22 cm x 22 cm nylon surface. 




Large scale thermocycling 

In addition to growing arrayed colonies on membranes, suitable DNA 
amounts can be generated by DNA amplification from colonies, e.g. by the 
polymerase chain reaction (PCR). PCR techniques (Saiki, Scharf et al., 1985; 
Saiki, Bugawan et al., 1986) have been developed for a long time and they play 
a central role in large scale genome analysis programs. Commercially avail- 
able cycling devices can handle up to four times 384 probes in parallel. We 
have built a laboratory thermocycling prototype for high throughput DNA 
amplification based on large water baths (Maier, 1995). A basket filled with 
135 different 384-well microtitre plates (51840 reactions) is moved with a 
pneumatically driven x/z sliding stage between three 220-liter water baths at 
three eligible temperatures. The microtitre plates are heat sealed with a 
plastic foil in a commercially available heat sealer to prevent cross contami- 
nation. After the amplification step the DNA product is sufficiently pure for 
spotting (Meier-Ewert, Maier et al., 1993). V 
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The high throughput experiments based on hybridisation require ideally non- 
radioactive detection methods. At present -only a few articles have been 
published that describe the use of directly labelled fluorescent probes for 
hybridisation with DNA on solid supports. This is due to the low signal to 
noise ratio that can be obtained with directly labelled probes. We use a hy- 
bridisation protocol which utilises an enzymatic signal amplification (Maier, 
Crollius et al., 1994). The DNA probe has a tag, e.g. digoxigenin, which is 
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recognised by an antibody conjugated with alkaline phosphatase (anti-dig- 
oxigenin-AP Fab fragment, Boehringer-Mannheim). The hybridisation is. 
visualised with a non-fluorescent substrate for alkaline phosphatase called 
Attophos. This substrate becomes highly fluorescent after a phosphate group 
in the Attophos molecule is removed. The detection is quite sensitive since 
each active centre of alkaline phosphatase can process about 10 4 to 10 5 sub- 
strate molecules per minute (Cherry, Young et ah, 1994). This labelling 
method is reliable for a wide range of hybridisation probes, ranging from 
short oligonucleotides to long PCR products. For documentation and analy- 
sis of hybridisation the positive signals are detected by excitation by UV light 
(365 nm) and photographed with a high resolution CCD camera (Photome- 
tries PXL, KAF 1400 chip) through a fluorescence emission filter (589 nm 
bandpass filter, bandwidth 80 nm, Herolab, Germany). The pictures are digi- 
tised into a Macintosh PowerPC 8100. The obtained spatial resolution is 
drastically increased in other detection systems that use Time Delay Integra- 
tion Linescan cameras or laser scanning principles. : 



Automated image analysis 



An important feature in high throughput laboratories is the automated, large 
scale characterisation of positive clones in the investigated libraries. The main 
requirements to such an automated analysis system are firstly the automatic 
grid finding on an array and secondly the determination of positive clones. 
Unfortunately, different hybridisation arrays show different, qualities. The 
quality of the picture can be affected by several factors, e.g. uneven distribu- 
tion of spots on a picture due to non flatness of nylon membranes or a high 



decision of whether a clone is positive or not. Nevertheless, the algorithms for 
an automatic spot finding are quite well developed (Geman and Geman, 1984; 
Lehrach et aL, 1997). At the current stage nearly 80 % of all positive clones 
can be scored automatically. 
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2.3 New technologies in high throughput screening: 
Miniaturisation is a driving force . 

Integrated circuits made personal computers possible that have revolution- : 
ised the world. In addition, the semiconductor industry has been able to 
double the complexity of a chip every 5 years with reducing cost. 

We now witness the development of modern chip-biology, which adopts 
methods and technologies from the semiconductor industry. High density 
arrays with integrated solid-phase oligonucleotide synthesis for rapid multi- 
plex analysis of nucleic acid samples have been introduced (Chee et al., 1996; 
Hacia et al., 1996; Schena et al., 1996). Microlithographic etching of silicon 
wafers enables the creation of precisely controlled structures, for example 
"obstacle courses", which might be a good substitution of common gels for 
separation of long polymer molecules (Volkmuth et al., 1995). Over the last 
few years, miniaturisation became a driving force in molecular biology and 
genome analysis. 

High throughput screening methods of clone libraries would benefit from 
further automation and miniaturisation as well. For example, increasing the 
density of arrays and therefore the number of DNA targets would increase 
the analysis speed and the information flow. As an alternative to conventional 
arraying with pins, a microdrop spotting on demand of technology was devel- 
oped. With the aim of reducing the size of hybridisation arrays by one or two 
orders of magnitude, the genetic samples are pipetted with a piezoelectric 
multi-channel microdispensing robot. The piezoelectric dispensing system 
was originally developed for use in ink-jet printers. The major part is a piezo- 
electric element or piezoelectronic actor in . tube shape, which expands and 



a tapered glass capillary with an outlet nozzle size of 25 jam to 50 pm. The 
liquid that has to be dispensed can be filled into the glass capillary in two ways. 
The first possibility is to aspirate the -liquid through the dispensing nozzle by 
applying a gentle vacuum. The second one is to fill the capillary from a reser- 
voir in the back. Once a glass capillary is filled with several microliters, liquid 
droplets can. be shot out by applying alternating voltage. The piezoactor 
expands and then contracts so that a droplet is fired out of the glass capillary 
(Fig. 3). The microdrop system is able. to dispense 25 jxm to 100 pm droplets 
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to pressure unit (-15 mbar) 



piezo-drop generator 

puis voltage approx. 110 V 
puis duration 41 us 

glas capillary 
nozzle (0 ~ 52 um) 




nanowell plate, membrane, slide 
or MALDI-MS target 



Figure 3 Scheme of a Piezo-Jet dispenser. A minimum of 7 1 pi of liquid has to 
be aspirated. Application of the right voltage together with the right pulse shape 
to the piezoelectric element results in a drop with 60 diameter, correspond- 
ing to WO pi. The drop velocity is approx. 0.58 m/s 
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(10 pi to 100 pi). The frequency of a commercially available single nozzle 
system is approximately 2000 drops per second. The droplet shape is influenc- 
ed by the diameter of the capillary and the cleanliness of the nozzle edge. 
Small crystals or a soiled tip surface result in poorly shaped droplets or no 
droplets at all. 

An eight-nozzle head has been recently implemented. It moves in the x, y, 
z positions with 5 pm resolution using a servo-controlled, linear drive system.. 
A 9-mm spacing between nozzles enables aspiration of solutions directly from 
the wells of a 384-well or 96-well microtitre plate. After aspirating the sam- 



is formed or not. The camera captures an image of a droplet in a stroboscopic 
light. Integrated image analysis system scans the image to verify the quality of 
the droplets. If a 'droplet is i" poorly formed; the image analysis system directs 
the head to clean the edge of the nozzle. The piezo-dispensing parameters, 
e.g. the voltage and impulse length for each of the nozzles, are independently 
controlled (Fig. 4). 

With the system described here it is also possible to perform precise filling 
procedures for nanotiter plates or silicon wafers. Therefore, a second video 
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Figure 4 Automatic drop control. For quality control of drops a stroboscopic image is acquired, (a) Shows 
the stroboscopic light and the microscope objective connected to the CCD camera. The eight dispensing 
capillaries can be inspected and adjusted. The drop parameters can be adjusted online (b) using user-friendly 
buttons for voltage, pulselength and stroboscope delay 



camera is attached to the dispensing head. It enables via software to identify 
certain cavities on a silicon surface and to dispense a chosen number of 
droplets independently into each of the chosen cavities (Fig. 5). . 

The spot size of a microdrop system on a nylon membrane varies between 
50 pm and 120 jim and the array density is approximately 2000 spots/cm 2 . The 



3 min to array 100 x 100 spots in a square with dot size of 100 }im diameter and 
230 jam distance between centres (Fig. 6). At this density it is possible to 
immobilise a small cDNA library consisting of 14000 clones on a microscope- 
slide surface. 

Another application of the microdispensing technology is the preparation 
of probe plates for MALDI (Matrix Assisted Laser Desorption Ionisation)- 
Mass-spectrometry, which has the potential for high throughput applications 
in DNA analysis. A commercially available MS instrument has a potential to 
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Figure 5 Piezo Drop Arraying. The picture shows a density of 2150 clones per 
square centimeter arrayed by the microdispensing device. The spots are 100 um. 
in diameter and the spacing between the spots is 230 um: Every spot contains 
the same DNA fragment The picture has been taken with a laser scanning device 
developed at the Max Planck Institute for Molecular Genetics in Berlin 

analyse one probe in a few milliseconds. We have developed, a prototype that 
uses microdispensed arrays of several thousand DNA fragments or proteins 
on a MALDI-MS target with a size of approx. 2 x 2 cm. 

Unfortunately, there are some drawbacks limiting the use of MALDI-MS 
in genome analysis. For proteins and genome projects the probes have to be 

reagents, drastically increase the background during the measuring. The 
background noise overlays the overall spectrum, making it very often impos- 
sible to interpret the obtained peaks7 Other limitations include the . mass 
range of the investigated species. At the current stage in MALDI-MS DNA 
sequencing a maximum of 80 bp can be resolved at a one basepair level 
(Murray, 1996; Kirpekar, Nordhoff et al., 1995). 

With the ongoing miniaturisation process in genome analysis, new tools 
have to be developed for all necessary handling steps in, e.g. a miniaturised 
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Figure 6 Filling of silicon wafer cavities. Each cavity was anisotropically etched to a size of 0.5 mm x 
0.5 mm. The picture shows a row. Five cavities on the lefthand side were filled with 5 nf (50 drops each 
containing 100 pi) of a fluorescent dye solution (Cy5, Amersham), one cavity was not filled and two cavities 
on the righthand side were filled with a 1 Of old lower concentrated solution of the dye. The total amount of 
Cy5 in the cavities varies between 1 femtomol (bright signals) and 100 attomoles (weak signals). Picture has 
been taken with the same detection device as in Figure 5 



hybridisation approach. In addition, the detection systems have to be im- 
. proved. Smaller spot sizes result in less amounts of targets and sensitive detec- 
tion systems are required with an increased spatial resolution, e.g. using light- 
optical principles. Optical methods like laser scanning devices for large areas 
or different microscopy methods including confocal laser scanning micro- 
scopes for areas smaller than a few mm 2 are the methods of choice in the near 
id a prototype that future. For the analysis of single molecules in small cavities other optical 

*ments or proteins methods like scanning nearfield optical microscopy (SNOM) (Moers et al., 

1996; Iwabuchi et al., 1997) or fluorescence correlation spectroscopy (FCS) . 
lse of MALDI-MS (Rigler, 1995; Oehlenschlager et aL, 1996) as well as non optical methods like - 

probes have to be atomic force microscopy (AFM) (Hansma et al., 1996; Lyubchenko and 

e measuring. The 
t very often impos- 

include- the mass - - :~ ' = : : — : • - - -: 

MALDI-MS DNA 2.4 Conclusions 

Dne basepair level 

In the future there will be more and more "Lab-On-A-Chip" devices (Service, 
analysis, new tools 1995), powerful integrations of microfluidics, micromechanics and detection 

e.g. a miniaturised systems (Kovacs et al., 1996). Some application of these devices could be in 
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the field of faster diagnostics. The microsystems- might also- include- oligo- ■ 
nucleotide arrays on different surfaces for DNA diagnostics (Mirzabekov, 
1994). There are promising examples of PCR integration on silicon wafers 
with online detection and/or analysis (Woolley et al., 1996; Taylor et aL, 1997).'* 
These technologies can screen many DNA samples cheaper, faster and highly 
parallel. In addition, reducing the size of the processes can provide a new 
experimental design (Burke et al., 1997), e.g. different techniques of liquid 
handling like electroosmotic pumps (Freaney et al., 1997) for handling of tiny 
reaction volumes. 

For these scenarios to become reality soon, molecular biologists must work 
more closely with engineers, physicists and chemists to remove the gap be- 
tween the macroscopic "laboratory world" and the microsystems "chip 
world". v 
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ABSTRACT 

+l ■ - 

This report compares several types of liquid handling equipment presently used in HTS. The devices include 96- 
well automated pipettors such as the Carl Creative PlateTrac™ (Harbor City, CA), Matrix PlateMate™ (Hudson, 
NH), Tomtec Quadra-96™ (Hamden, CT) and a Zymark RapidPlate-96™ (Hopkinton, MA) integrated into a full 
robotic system. A general set of considerations including ease of programming, assay-completion time, accuracy 
and precision of liquid dispensing, and low-volume pipetting were evaluated. Both a protease screen and a cell- 
based reporter gene assay were used as model systems for comparison. The data indicate that the Carl Creative 
PlateTrac has an advantage in several areas. These include the ease in programming, reduction in assay run time, 
and increased accuracy and precision in liquid dispensing, especially for volumes of 1 fi\ or less. However, both the 
Matrix Platemate and Zymark robotic systems may be used to perform complicated multi-step tasks involving mul- 
tidirectional plate transfer, which is not possible on the current PlateTrac. Advantages and limitations of each piece 
of equipment are discussed further in this report 



INTRODUCTION . ' in the number of plates that can be processed in a given assay 

run. Higher throughputs can be achieved with more efficient 



ecules. Some key considerations in efficient HTS automation tra- and interplate variation. In addition, the ability to accu- 

are (1) the speed at which the system can process plates, (2) rately transfer small volumes of test compound decreases the 

the accuracy and precision of the liquid handling, (3) the ease number of pipetting steps because compound dilution steps are 

- of equipment programming and operation, arid (4) me ability mmuriized or elimmated "entirely; For example, in cell-based" 

to pipet small volumes (<1 /il). Although the majority of the assays where sensitivity to organic solvents such as DMSO is 

equipment used today in HTS allows simultaneous 96-well a limiting factor, the capability to transfer 1 fi\ or less of com- 

pipetting, the increase in compound library size provides a chal- pound in neat DMSO to the assay is essential, 
lenge to develop and use the most efficient automation for For further equipment validation and comparison, two com- 

screenmg. monly used HTS assays, a protease assay and whole cell-based 

The main criteria for analyzing equipment efficiency in this reporter gene assay, were chosen. The protease assay selected 

study were assay prognmuriing time, assay run time, minimum is well, suited for evaluation of the criteria noted above for sev- 

dispense volume, and accuracy and precision of pipetting eral reasons. The assay is simple and easily adaptable to HTS 

of both organic and aqueous solutions. Dimethylsulfoxide and the automation evaluated in this article. The enzyme and 

(DMSO) and water were selected as typical examples of each related reagents remain stable for the length of the study. Fi- 

solution type. The rationale for including programming time in nally, it uses fluorescence as the detection method, which is a 

the equipment comparison criteria is that for biological assays well-established signal type that remains linear with concen- 

that require multiple steps, instrument prograrnming can be time tration. In contrast, cell-based assays, especially transcriptional 

consuming and require highly trained personnel. Plate manip- assays using reporter gene based systems, provide a unique 

ulation and reagent addition time are the key limiting factors challenge in adaptation to the HTS format. 1 * 2 Accurate and ef- 
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ficient automation can be the key to whether a cell-based as- 
say is scalable for true HTS. The cell-based assay selected uti- 
lizes a luciferase reporter system, which is highly sensitive and 
generates a signal that is stable for 5 hours after substrate 
(LucLite™, Packard Instrument Company, Meridan, CT) addi- 
tion. 



MATERIALS AND METHODS 

Accuracy and precision 

The pipetting accuracy was evaluated by pipetting a given 
: /'volume in the range of 0.5 to 100 pi into each well of a 96- 
well plate and measuring the change in mass of the entire plate. 
Ultimately, the change in plate mass in conjunction with the 
known pipetting precision yields a measure of pipetting accu- 
racy. The precision of liquid handling for each instrument was 
determined by measuring the coefficient of variability (%CV) 
across a five-plate set. More specifically, the well-to-well pre- 
cision was determined by pipetting 7-methoxycoumarin-4-yl 
acetic acid (MCA) (BACHEM, Torrance, CA), either in water 
or neat DMSO into Packard black HTRF plates (Packard In- 
strument Company, Meridan, CT). Plates were then read on a 
Cytofluor series 4000 fluorometer (Perseptive Biosy stems, 
Framingham, MA). Source plates were made with stock con- 
centrations of MCA of 1 to 200 pM in 200 p\ of distilled wa- 
ter or neat DMSO. The test plates were then prepared by trans- 
ferring the desired volume from the source plate directly into 
each, of five separate test plates. The same set of pipette tips 
was used for each five-plate set at each volume. The final vol- 
ume in the test plates was brought to 100 /xl with distilled wa- 
ter or DMSO (consistent with the solution type transferred) to 
yield a final concentration of 1 pM MCA and the plates were 
then read at Exm^, Em^sran- 

The pipetting protocols used for each instrument were sim- 
ilar. The manufacturer-designed tips were used for each in- 
vr ^iniment^j^ 

Plate-96 an air gap was first pulled, the actual liquid volume 
was aspirated, and the entire volume aspirated was then dis- 

.. _ .pensed. On the Carl Creative PlateTrac, a 1- pi liquid-prer.dis^ 

. pense and \-p\ liquid carry volume were used. The pipette tip 
touch to the plate was variable, depending on the volume 
transferred. At 0.5 and 1 pi, the tips were programmed to 
touch the well bottom for all instruments. For the larger vol- 
umes, the liquid transfer was performed at user-programmed 
heights above the well bottom. A tip touch was not used af- 
ter dispense. 

Matrix metallqproteasel MCA assay 

The intra- and interinstrument reproducibility for all four in- 
struments were evaluated using a matrix metalloprotease 
(MMP) assay as follows. Plates were set up with eight control 
wells, an eight-point titration curve of a known inhibitor (in du- 
plicate), and 72 wells of DMSO as an indicator of variation 
across the plate. Assay plates were first prepared by pipetting 
1 pi of sample at 100 times the desired final concentration into 
dry Packard black HTRF plates; A quantity of 84 pi of MCA 
peptide substrate (10 pM final concentration) in IX reaction 



buffer (50 mM TRICINE, 10 mM CaCl 2 , 0.002% NaN 3 , pH 
7.5) was then added to each well. The MCA peptide has the 
sequence Mca-Pro-Leu-Gly-Leu-Dpa-Ala-Arg-NH 2 (Mca = 7- 
methoxycoumaxin acetic acid; Dpa = 3(2'4'-dinitrophenyl>- 
2,3-darnino propionic acid). Finally, 15 pi of MMP enzyme 
(12.5 nM final concentration) was added to each well for a to- 
tal assay volume of 100 pi. The enzyme was thawed just prior 
to use and diluted to 83 nM with enzyme buffer (50 mM 
TRICINE, 0.05% BRJJ-35, 400 mM NaCl, 10 mM CaCl 2 , 
0.02% NaN 3 , pH 7.5). Plates were incubated for 90 min at 25°C 
on an orbital shaker. After incubation, 20 /xl/well EDTA (500 
. nM) was added to quench the reaction and the plates were read 
on a Cytofluor 4000 series fluorometer at Ex3i 3nm , En^ma. 

Cell-based reported gene assay 

The PlateTrac and the Zymark integrated robotic system 
were selected for comparison of reproducibility using a cell- 
based luciferase reporter gene assay. The PlateTrac can handle 
all aspects of this assay except pipetting of cells. The cells used 
in the assay would not remain in uniform suspension through- 
out the run, so the cells were instead added using a Titertek 
multidrop (Titertek Instruments, Inc., Huntsville, AL). Al- 
though the Zymark robotic system can handle all aspects of the 
cell-based assay, in the interest of efficiency the choice was 
made to use the PlateTrac for compound preparation. In addi- 
tion, the assay plates were incubated in the absence of CO2 in 
the Zymark run because frequent incubator door access yielded 
inconsistent CO2 exposure throughout the assay run. 

The PlateTrac device was used to prepare all assay plates 
' with the test compounds. For a cell-based assay it is critical to 
maintain consistent DMSO levels due to the sensitivity of mam- 
malian cells to organic solvents. From each compound source 
plate, 1.25 pi of compound per well was pipetted directly into 
the appropriate assay plate at 100X final concentration. For 
screening using the PlateTrac, black Packard HTRF plates were 

^cj££jpj^ 

ing the Zymark integrated robotic system, white Packard Op- 
tiplates were used in conjunction with the Packard LumiCount, 
already integrated into the Zymark system, in order to maxi- 
"mize signal. ^ 

Two stably transfected Jurkat T-cell lines were used in this 
assay: an inducible expressor of the gene of interest, used to 
determine specific activity of the tested compounds, and a con- 
stitutive expressor used to determine cell suppression/toxicity. 
These cells were routinely maintained in RPMI 1640 (Gibco 
BRL, Gaithersburg, MD) with 5% FBS (Hyclone), IX 
nonessential amino acids (Gibco BRL), L-glutamine, phenol 
red, and 3 mg/ml Geneticin (Gibco BRL) as a selection agent 
to ensure continued expression of their construct. Cell prepa- 
ration techniques were identical for both clones. On the day of 
the assay, the cells were prepared in RPMI 1640 without phe- 
nol red (Gibco BRL), 5% FBS, IX nonessential amino acids, 
200 fjuM HEPES, and 0.1% Gentamycin (Gibco BRL), and re- 
suspended to a concentration of 8 X 10 5 cells/ml to give 50,000 
cells/well final concentration in the assay. Media without phe- 
nol red must be used in the assay system due to interference in 
readout of the luminescence by the phenol red. Addition of 
HEPES ensured consistent pH throughout the assay run. 
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For the assay on the PlateTrac, 61 /xl of the media with or 
without activators was added to each well of a compound plate. 
Using the Titertek Multidrop dispenser, 65 itl of cell solution 
was pipetted into each well of all plates. The plates were in- 
cubated at 37°C in a humidified incubator (+5% C0 2 ) for. 5 
hr. Using the PlateTrac system for reagent addition, 125 zxl of 
LucLite™ reagent was then added to each well of all the plates. 
The use of LucLite allowed measurement of luminescence as 
a glow reaction with a stable signal duration of 5 hr. The plates 
were read for luminescence on the Packard Top Count™. 

For the Zymark automated system, all plates and reagents 
were prepared and placed in the proper locations prior to com- 
mencing the* run. The Zymark robotic system then retrieved the 
assay plate (containing 1.25 /xl compound) from the incubated 
carousel. A total of 61 /xJ media with or without activators was 
then added to the appropriate plate wells using the Reagent Ad- 
dition Station (RAS) and the Zypettor. Cells were added in 65/i.l 
to each well using the Zypettor pipetting station. The plate"3vas 
then placed in a 37°C humidified cell culture incubator (CO2) 
for 5 hr. After incubation, 125 /xJ LucLite was added to the en- 
tire plate using the RAS. The plate was then incubated at room 
temperature with gentle shaking for 5 min. Plates were read for 
luminescence on the Packard LumiCount. 



RESULTS AND DISCUSSION 

Assay adaptation to HTS 

In this study the Carl Creative PlateTrac, Matrix PlateMate, 
Tomtec Quadra-96, and Zymark integrated robotic system were 
compared. Design of the above equipment, except for the Plate- 
Trac and its use in HTS, have been described elsewhere. 3-6 The 
PlateTrac design is based on a high-speed conveyor belt sys- 



tem. Two different PlateTrac. systems were used in this study.; 
The first system can handle one stack of 96-well plates, allows 
for plate washing, turbo drying with compressed air, and 
reagent addition by two separate dispense heads. The second 
PlateTrac system can handle two stacks of plates, either of 
which can be the standard 96-well plate or deep well 96-well 
plates. The system also contains a plate washer and a dispense 
head with tip wash. 

Design differences of each system led to distinct program- 
ming requirements for each instrument. The PlateTrac with two 
dispense heads can run the protease assay, with both reagent 
additions, as one program. The Tomtec Quadra-96 and the Ma- 
trix PlateMate have only one dispense head, which required two 
programs to be written and run consecutively (or linked) to 
complete the assay. The Zymark RapidPlate-96 has only one 
pipetting head, but because it is only one component of an in- 
tegrated track-based system with a robotic arm, many reagent 
stations can be used to run an entire assay protocol as one pro- 
gram. 

. In many cases a successful high throughput screen depends 
on the ability to transfer small volumes of liquid accurately and 
precisely from one plate to another. Therefore, the initial set of 
experiments compared the liquid transfer capability of each de- 
vice in the range of 0.5 to 100 /-d. All experiments were per- 
formed in 96-well microliter plates. For comparison, the capa- 
bility of each device to pipet both DMSO, which was used to 
mimick compound transfer, and distilled water, which was used 
to mimick assay reagent .transfer, was tested (Tables 1 and 2). 
The data indicate that in the low-volume range the PlateTrac 
repeatedly gave the most accurate and precise values. All of 
the instruments consistently pipetted DMSO better than water. 
This is likely due to the differences in surface tension of the 
two liquids and their interaction with the plastic tips or the re- 
cipient plate. Overall, the data indicate that for DMSO solu- 



Table 1. Pipetting Accuracy 
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Pipetting accuracy determined by dispensing set volumes of either DMSO/MCA solution or water/MCA solution into pre weighed plates. The net weight 
of fluid dispensed into each plate was recorded, divided by the density of either the water or DMSO at room temperature, and divided by 96, multiplied by 
1000 to obtain average microliters per well. 

a Programming limitations did not allow for programming volumes in less than 1 /xl. 

b Actual volume transferred after nominal volume settings were adjusted to yield values closest to desired volume. 
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Table 2. Automation Pipetting Precision (%CV) 



Coefficient of variation (n — 480 wells) 
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2.9 


2.3 
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Pipetting precision was determined by dispensing solutions of either DMSO/MCA or water/MCA into dry Packard black HTRF plates. The wells were 
then brought to a final volume of 100 /il with water using the same dispense head and plates were read on a Cytoflour series 4000 flourometer at Ex 313nin , 
Em395nni- The means, standard deviations, and % CVs (standard deviation divided by the mean and expressed as a percentage) were calculated for all rows 
and columns of the 96- well plate. 

a Programming limitations did not allow for programming volumes in less than 1 /xl. 



tions, the PlateTrac also had the lowest CV across the entire 
liquid dispensing range. For aqueous solutions, the Zymark 
RapidPlate-96 (in most cases) yielded the lowest CVs. It is im- 
portant to note that in the case of the Quadra-96 and Rapid- 
Plate-96, two values are shown for a given nominal pipetting 
volume with respect to accuracy of actual transferred volume 
(Table 1). All four instruments were first programmed with the 
nominal volume desired to be pipetted and the results expressed 
accordingly. Volume settings were then programmed to yield 
more accurate results (i.e., closer to the desired volume) when 
necessary. The resultant actual volumes transferred are shown 
in bold for Table 1. The disparity between nominal and real 



be written as two separate programs. The Tomtec Quadra-96 
and the Matrix PlateMate were comparable in progranirning 
and plate cycle time. The Zymark System required longer ini- 
tial r^grairnrring time, but once set up, it was fully automated. 
The other three instruments are workstations that must be at- 
tended to when stacker capacities have been reached. The Plate : 
Trac allowed the fastest plate handling of 0.5 min for plate 
destack, reagent aspiration and dispense, and plate restacking, 
as compared with over 2 min for the same process for the other 
instruments. The very fast plate cycle time for liquid handling 
was made possible by the conveyor belt design, allowing 
separate modules of the machine to be working at the same 



an automated pipetting inventory. The Matrix PlateMate used 
for the low volumes (10 pi and less) was a demonstration in- 
strument brought in for this comparison, and the Tomtec 
"Quadra-96 is the oldest of the" four systems evaluated. These 
results stress the importance of detennining the accuracy of an 
automated pipetting system when validating any assay. Because 
data quality is proportional to pipetting accuracy and precision, 
reagent properties vary greatly and can significantly impact re- 
sults as shown in Table 1. At the higher volume of 100 /tl, all 
of the instruments were very precise, with CVs approximately. 
3% or less. For all volumes tested, all instruments pipetted 
within the manufacturers reported specifications. 

The time required to program each instrument and the po- 
tential assay throughput were compared (Table 3). The data in- 
dicate that the PlateTrac was the fastest to program and also 
had the quickest plate-handling cycle time. The fast program-, 
ming time was due to the touch screen program control, which 
allowed rapid program creation in conjunction with the vali- 
dation for plate heights, speed controls, and working volume 
range. The disadvantage of this type of programming was that 
only relatively simple programs could be entered. For exam- 
ple, sequential applications, such as dilution and transfer, must 



a stable reagent is desired to extend the total throughput of a 
process. 



Table 3. MMP-3 Approximate Programming 
and Liquid Handling Cycle Times 





Programming 


UH cycle 




time 


time/plate 


CCWS PlateTrac 


10 min 


0.5 min 


Matrix PlateMate 


35 min 


2.3 min 


Tomtec Quadra-96 


25 rniri 


2.45 min 


Zymark RapidPlate-96 


45 min a 


2.67 min 



Programing times were determined for writing the program to perform all 
required steps on that instrument, and running the program validation for 
speed settings and plate heights. Times take into account user familiarity 
with instrument programming. Liquid handling cycle times include time to 
process plate through program and loading of fresh tips where required. 

"Denotes entire assay programming time as opposed to solely liquid 
handling programming time. 
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Table 4. Comparison of IC50 Values of a Known MMP-3 Inhibitor on Each Instrument 
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1C 50 (nM) 



10.0 




PlateTrac 


Matrix PlateMate 


Tomtec Quadra-96 


Zymark Rapid-Plate-96 




^ Plate 1 


11.85" 


. 11.92 ' 


15.07 


11.58 




Plate 2. 


11.95 


12.05 


13.76 


. 11.61 


.67 


; Plate 3 


11.92 


12.03 


14.24 


12.93 . 




:: ' Plate 4 


11.93 


11.99 


14.15 


11.78 


.0 


' Plate 5 


11.92 


11.91 


15.79 


11.11 


.5 


Average 


11.91 


11.98 


14.60 


11.80 



The IC50 of a known MMP-3 inhibitor was determined for each plate in a five plate set on each instrument Each IC50 value represents the mean of 
duplicate well values in the IC50 titrations for that plate. Assay was run in 100 /tl final volume. 



Automation of protease assay in HTS 

MMPs are widely involved in extracellular matrix remodel- 
ing and therefore comprise an attractive model for various 
pathologic processes, such as morphogenesis, angiogenesis, and 
metastasis. 7,8 Due to the wide utility, ease, and number of ex- 
amples of protease-based assays, an MMP assay was chosen 
for use as a model system to test the capabilities of the devices 
mentioned above. In this assay a donor-quencher fluorogenic 
peptidyl substrate was used. The basis of the assay was the re- 
lease of the Dpa quench, as a result of the cleavage of the Leu- 
Dpa bond, generating a fluorescent signal. The specific MMP 
inhibitor was selected from an in-house compound library. 

Five inhibitor titration plate replicates were screened on the 
PlateTrac, PlateMate, Quadra-96, and Zymark integrated ro- 
botic system. The IC50 of a known inhibitor (Table 4) and 
DMSO to mimic random compound distribution (Table 5) were 
used to deterrnine the intra- and interplate variability. These as- 
pects are critical in the subsequent statistical analysis to iden- 
tify potentially active compounds. The data indicate that the 
PlateTrac and the RapidPlate-96 on the Zymark system had the 
lowest variability both within a single plate and across a five- 



late set" 



Automation of cell-based assay in HTS 



ported the Carl Creative PlateTrac as the most efficient off-line 
workstation to compare to a fully integrated robotic system. In 
this study, a luciferase reporter gene fused to four consensus 
sequence copies of the transcriptional binding site of the gene 
of interest was used. Two cell lines were used (inducible and 
constitutive) to allow for the distinction of compounds that dis- 
played true inhibition of the expression of the gene of interest, 
versus general suppressionAoxicity by the compound. The use 
of luciferase comprised a sensitive and convenient read-out sys- 
tem and was previously validated in the HTS format. 9 

To compare both automation routes, ten random compound 
plates from an in-house chemical library were screened against 
both cell lines. All plates also included the following controls: 
nonactivated cells, activated cells, and wells containing both 
activated cells and a specific inhibitor. All of the compound ac- 
tivity values were normalized for the activated cell control re- 
sponse and expressed as percentage inhibition. Percentage CVs 
were similar for both instrument systems and were in the range 
of 9% to 18% for the PlateTrac and 11% to 21% for the Zy- 
mark. However, for the Zymark activity run, the average com- 
pound inhibition was 33%, as compared with 0% for the Plate- 
Trafi^data- set^igr.-l).^Iuis*iuiclear.: what, factors -kiay'-'fcave-*™ • 
-**-oe-rrt??¥utea-H6 -this~*shiJt~ but-^ctWity^ctKoSs -wrtr^etemin«ii^~ 
consistently for both data sets. The toxicity assay data distri- 
butions were similar for both systems (Fig. 1). The discrepancy 



_The final set of expermenlsjcompared the use of the Plate- between me ^ may be due to the dif-. 

Trac and the Zymark integrated robotic system in a cell-based ference in the mechanism of luciferase expression for the in- 

assay. The Quadra-96 and the Matrix PlateMate were not in- ducible versus constitutive cell lines. The cell lines also respond 

eluded in this set of experiments because the initial data sup- to stress at different levels (data not shown), and the lack of 
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Table 5. MMP-3 Assay Intra- and Interplate Variability 



Average signal ± SD (%CV) 





CCWS PlateTrac 


Matrix PlateMate 


Tomtec Quadra-96 


Zymark Rapid-Plate-96 


Plate 1 


159.2 




3.1 (1.9%) 


122.9 




5.0 (4.1%) 


184.9 ± 5.3 (2.8%) 


220.6 ± 5.2 (2.4%) 


Plate 2 


161.3 




3.3 (2.0%) 


116.8 




4.7 (4.1%) 


200.0 ± 7.0 (3.5%) 


233.4 ± 5.8 (2.5%) 


Plate 3 


165.6 


-4- 


2.8 (1.7%) 


116.5 




3.8 (3.8%) 


211.5 ± 7.7 (3.6%) 


239.6 ± 5.5 (2.3%) 


Plate 4 


162.5 




3.1 (1.9%) 


118.9 


it 


3.9 (3.3%) 


218.3 ± 10.3 (4.7%) 


236.3 ± 5.6 (2.4%) 


Plate 5 


163.8 


± 


3.5 (2.1%) 


119.8 




3.5 (2.9%) 


222.9 ± 6.1 (2.7%) 


229.4 ± 5.6 (2.5%) ' 


Average 


162.5 




3.2 (2.0%) 


119.0 




4.2 (3.6%) 


207.5 ± 7.3 (3.5%) 


231.9 ± 5.5 (2.4%) 



The well-to-well variation was determined within .the plate by calculating the mean signal : background, standard deviation, and % CV of predotted 
DMSO (mimic compound transfer) with substrate and enzyme (n = 72). The interplate variability was determined for each instrument across the 
five-plate set 
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PlateTrack Activity Assay 




PlateTrack Toxicity Assay 
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Zymark Activity Assay 
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FIG. 1. Distribution across a ten plate random compound set in the cell-based reporter gene assay for the PlateTrac and the Zymark 
system. The distribution is shown for all ten compound plates screened in each cell line for both the PlateTrac and the Zymark system. 
The activity data represent the inducible ceU line assay and the toxicity data were generated using the constitutively expressing cell line. 
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C0 2 in the Zymark run may put a higher level of stress on the 
inducible ceD line. 

The cutoff values for identifying active compounds in both 
the activity and toxicity assays were established by calculatinc 

tions of the sample wells. Compounds were considered to be 
active if they were found to yield an activity value greater than 
two standard deviations above the mean (95.5% confidence). 
The activity cutoff values are different for each assay on each" " 
instrument due to variations in the means and standard devia- 
tions determined for each plate set run. For the PlateTrac in- 
strument, the activity cutoff, expressed as percentage inhibi- 
tion, was 34%, and the toxicity cutoff was 3 1 %. For the Zymark 
instrument the activity cutoff was 56% and the toxicity cutoff 
was 40%. The summary of the active compounds is presented 
in Table 6. From a total of ten compounds identified as active, 
six were in agreement for both instrument runs to be both ac- 
tive and not toxic. Of the four remaining compounds, two (com- 
pounds 6 and 7) were found to be active on the PlateTrac, bor- 
derline active on the Zymark, but not toxic for both instruments. 
Compound 3 was active but not toxic on the PlateTrac only. 
This same compound was weakly active, but toxic on the Zy- 
mark. Finally, compound 9 was inactive on the PlateTrac, ac- 
tive on the Zymark, and not toxic on both instruments. In most 
cases the percentage inhibition of the tested sample was higher 
for the Zymark run. The difference may be due to the absence 
of CO2 during the assay runs on the Zymark, as discussed 



above. Confirmatory testing is necessary to further deteirrune 
activity and toxicity characteristics of these compounds. 



in the Cell-based Reporter Gene Assay 
on the CCWS and the Zymark System 



- % Inhibition 



Activity assay 



Toxicity assay 





PlateTrac 


Zymark 


PlateTrac 


Zymark 


Compound 1 


86 


95 


0 
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Compound 2 


64 


58 


4 


0 


Compound 3 


57 


40 


6 


60 


Compound 4 


41 


61 


0 


11 


Compound 5 


45 


82 


4 


22 


Compound 6 


43 


54 


4 


14 


Compound 7 


39 


55 


0 


19 


Compound 8 


70 


100 


0 


0 


Compound 9 


17 


89 


12 


0 


Compound 10 


39 


68 


0 


2 


Cutoffs (±2cr) 


34 


56 


. 31 


40 



The compound hits are recorded as percentage inhibition of control wells. 
The cutoff for a compound hit was determined as the average activity plus 
2 SD. The cutoff criteria are different for each instrument due to variability 
of the plates across a ten-plate set 
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CONCLUSION 

The PlateTrac is the most accurate and precise instrument of 
the four tested when pipetting small volumes. This provides an 
invaluable advantage for compound transfer in any assay. In 
. addition, the PlateTrac system requires less time for program- 
ming and has a significantly faster plate processing time when 
compared with the other instruments. Another advantage of the 
PlateTrac is that the pipetting heads can be easily interchanged 
between the 50- /xl 96-well format and. the 200-^1 96- well for- 
mat. The Matrix PlateMate, Tomtec Quadra-96, and Zymark 
integrated robotic system have an advantage over the current 
PlateTrac. .in programming multi-step operations requiring 
multi-directional plate movement (newer versions of the Plate- 
Trac allow multi-directional plate movement but were not avail- 
able for testing). This is one of the reasons that the program- 
ming time is more extensive on the other equipment than for 
the PlateTrac. The Zymark integrated robotic system is the only 
instrument tested that has the advantage of full on-line assay 
automation without human intervention. 

Intra- and interplate variability results for the protease assay 
showed all four instruments to perform well with CVs under 
5% an<j relatively good correlation between instruments. The 
IC50 data for the control compound in this assay was also very 
reproducible between instruments. 

For the cell-based reporter gene assay, both the PlateTrac 
and the Zymark integrated robotic system provide an efficient 
design for HTS. However, each system has key limitations. The 
PlateTrac, with the instrument design tested here, is not effi- 
cient for cell suspension and transfer. The Zymark system is 
limited in the ability to quickly prepare compound test plates 
for an assay and, in the case of this cell-based assay, unable to 
provide consistent CO2 levels during the plate incubation. 

In summary, to automate a high throughput screen, there are 
many considerations to take into account. The first of these is 
the decision to use an integrated robotic system or to use a 
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entire assay. If using the workstation approach, one must then 
decide which instrument is most suitable for the assay. This re- 
port provides information on the advantages and limitations of 
three different workstations and one integrated robotic system 
to facilitate that decision. 
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